# Deep Learning & Applied AI

We reccomend to go through the notebook using Google Colaboratory.

# Tutorial 7: Uncertainty, regularization and the deep learning toolset

In this tutorial, we will cover:

- Uncertainty in deep learning
- Dropout and Batch normalization
- Deep Learning tools and code best practices

Our info:

- Luca Moschella (moschella@di.uniroma1.it)
- Antonio Norelli (norelli@di.uniroma1.it)

Course:

- Website and notebooks will be available at [DLAI-s2-2020](https://erodola.github.io/DLAI-s2-2020/)

##Import dependencies (run the following cells)

In [2]:
# @title import dependencies
from __future__ import print_function, division
from typing import Mapping, Union, Optional

import numpy as np
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import plotly.graph_objects as go
import plotly.express as px
import torchvision
from torchvision import datasets, models, transforms

import os
import pickle
from tqdm.notebook import tqdm



In [3]:
# @title reproducibility stuff

import random
torch.manual_seed(42)
np.random.seed(42)
random.seed(0)

torch.cuda.manual_seed(0)
torch.backends.cudnn.deterministic = True  # Note that this Deterministic mode can have a performance impact
torch.backends.cudnn.benchmark = False

##Uncertainty in deep learning and two popular regularization techniques

In this last section we will breafly discute about *uncertainty* in deep learning, an inescapable concept whenever we aim to extrapolate a general rule from finite data.

We will also experiment with two popular *regularization* methods, dropout and batch normalization.

Surprisingly enough these two arguments fits well in the same section, since a very effective and simple way to model uncertainty in deep learning is through dropout.




#### Uncertainty in deep learning

> "*In almost all circumstances, and at all times,
we find ourselves in a state of uncertainty.*
>
>*Uncertainty in every sense.*
>
>*Uncertainty about actual situations, past and present (this might stem from either a lack of knowledge and information, or from the incompleteness and unreliability of the information at our disposal, either ours or someone else's, to provide a convincing recollection of these situations.) [...]*
>
>*Uncertainty in the face of decisions: more than ever in this case, compounded by the fact that decisions have to be based on knowledge of the actual situation, which is itself uncertain, to be guided by the prevision of uncontrollable events, and to aim for certain desirable effects of the decisions themselves, these also being uncertain.*"

>Bruno de Finetti *Theory of Probability: A critical introductory treatment*, Chapter 2



Despite representing model uncertainty in deep learning is of crucial importance -- think about medical applications or self-driving cars -- standard DNNs do not provide such information. 

The $\pm$ symbol denoting the confidence interval of predictions is rare in deep learning papers, even if a prediction of a DNN on a test sample is everything except certain; uncertainty does not origins only in intrinsic stochastic processes (such as a radioactive decay or the roll of a dice), but *also when we have a lack of knowledge*, when we try to make a bet on something that is out of our ground truth, such as a test sample.

Notice that the probability interpretation of a softmax output does not solve the problem, **a model can be uncertain in its prediction even with a softmax output close to 1** as we will see.

Today we will explore a very simple idea to model uncertainty in deep learning through dropout, following the works of Gal and Ghahramani:
- [*Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning*](https://arxiv.org/abs/1506.02142) 
-[*Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference*](https://arxiv.org/abs/1506.02158)

#### Dropout and Batch Normalization; two common regularization methods

As seen in lecture, regularizers are general methods to reduce overfitting and thus improve generalization.

Regularization methods are based on general considerations about the learning algorithm, their ultimate objective is to reduce the number of free parameters.

Today we will experiment with 
- **Dropout**: Training an ensemble of neural networks parametrizing each model by dropping random units from a single large network.
- **Batch Normalization**: Normalizing the activations of hidden layers as we do with the input data, allowing an easy learning of the identity function for the hidden layer.

### Training a bunch of models on CIFAR10

Today we will train several models on CIFAR10, experimenting with the effects of regularization and trying to say something about the uncertainty of our predictions.

Let's download and normalize the dataset...

In [0]:
train_transform = transforms.Compose(
    [
     transforms.ToTensor(),
     transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))])

test_transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=train_transform)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=test_transform)


Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified


...and then prepare the dataloaders.

In [0]:
batch_size = 32
classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
print('train set size: {}'.format(len(trainset)))
print('test set size: {}'.format(len(testset)))


trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=4)

testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=4)


train set size: 50000
test set size: 10000


As always we want to visualize some samples. Make your own prediction on each one.

In [0]:
# @title Visualize samples function

def visualize_samples(inputs, title=None):
    """
    Visualization of transformed samples, a standard call:
        inputs, classes = next(iter(dataloaders['train']))
        visualize_samples(inputs)
    Arguments:
    batch_of_samples -- a batch from the dataloader; a PyTorch tensor of shape (batch_size, 3, 224, 224)

    Return:
    None (A nice plot)
    """
    
    # Make a grid from batch
    inp = torchvision.utils.make_grid(inputs, nrow=12)

    inp = inp.numpy().transpose((1, 2, 0))
    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    inp = std * inp + mean
    inp = np.clip(inp, 0, 1)  # plotly accepts the colour information both in the 0-1 range and in the 0-255 range
    fig = px.imshow(inp, title=title)
    fig.show()


# Get a batch of training data
inputs = [trainset[i][0] for i in range(4)]
class_idx = [trainset[i][1] for i in range(4)]


visualize_samples(inputs, title=f'Make your prediction, which is the label of each image? The labels are<br> {[x for x in classes]}')  

# Solution
# print(f'Ground truth: {[classes[x] for x in class_idx]}')




> **EXERCISE** Be conscious about your uncertainty on these predictions, how much would you bet on your guesses on each image?

For the purpose of our experiments we will work with a very simple CNN architecture, similar to the famous LeNet of 1998 by Yann Lecun, a time when there was not dropout nor batch normalization.

We will try 4 + 1 different models:
- Lenet  without dropout nor batch normalization (`VanillaLeNet`).
- Lenet with standard dropout after each fully connected layer (`StdDropoutLeNet`).
- Lenet with dropout2d (zeroing entire channels) after each convolutional layer and standard dropout after each fully connected layer (`FullDropoutLeNet`).
- LeNet with Batch Normalization after each layer (`BatchNormLeNet`)

To be less boring than usual today we will use **SELUs** (scaled exponential linear units); a very exotic non-linearity from the crowded [zoo of activation functions](https://pytorch.org/docs/stable/nn.functional.html#non-linear-activation-functions). If Relu is the lion, Selu could be a platypus; spot its definition in the PyTorch documentation.

<img src="https://storage.googleapis.com/groundai-web-prod/media%2Fusers%2Fuser_14%2Fproject_393006%2Fimages%2Fselu.png" alt="drawing" width="400"/>

Selu has been proposed in [*Self Normalizing Neural Networks*](https://arxiv.org/abs/1706.02515) and it is supported by a mammoth ablation study on more than 100 machine learning tasks and a 90-pages-long appendix full of calculations. The main point of SELUs is to induce self-normalizing properties without the necessity of batch normalization. This is of special interest for FNNs (fully connected networks) with many layers, which suffer most the perturbations induced by batch normalization.





In [0]:
# We will perform the final softmax out of the net classes to be ready for a future experiment

# Lenet without dropout 
class VanillaLeNet(nn.Module):
    def __init__(self):
        super(VanillaLeNet, self).__init__()
        self.conv1 = nn.Conv2d(3, 192, 5, padding=2)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(192, 192, 5, padding=2)
        self.fc1 = nn.Linear(192 * 8 * 8, 1024)
        self.fc2 = nn.Linear(1024, 256)
        self.fc3 = nn.Linear(256, 10)

    def forward(self, x):
        x = F.selu(self.pool(self.conv1(x)))
        x = F.selu(self.pool(self.conv2(x)))
        x = x.view(-1, 192 * 8 * 8)
        x = F.selu(self.fc1(x))
        x = F.selu(self.fc2(x))
        x = self.fc3(x)
        return x


# Lenet with dropout after fully connected layers
class StdDropoutLeNet(nn.Module):
    def __init__(self):
        super(StdDropoutLeNet, self).__init__()
        self.conv1 = nn.Conv2d(3, 192, 5, padding=2)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(192, 192, 5, padding=2)
        self.fc1 = nn.Linear(192 * 8 * 8, 1024)
        self.fc2 = nn.Linear(1024, 256)
        self.fc3 = nn.Linear(256, 10)
        self.dropout = nn.Dropout(p=0.3)

    def forward(self, x):
        x = F.selu(self.pool(self.conv1(x)))
        x = F.selu(self.pool(self.conv2(x)))
        x = x.view(-1, 192 * 8 * 8)
        x = F.selu(self.fc1(x))
        x = F.selu(self.fc2(self.dropout(x)))
        x = self.fc3(self.dropout(x))
        return x
    

# Lenet with dropout also after convolutional layers 
class FullDropoutLeNet(nn.Module):
    def __init__(self):
        super(FullDropoutLeNet, self).__init__()
        self.conv1 = nn.Conv2d(3, 192, 5, padding=2)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(192, 192, 5, padding=2)
        self.fc1 = nn.Linear(192 * 8 * 8, 1024)
        self.fc2 = nn.Linear(1024, 256)
        self.fc3 = nn.Linear(256, 10)
        self.dropout = nn.Dropout(p=0.3)
        self.dropout2d = nn.Dropout2d(p=0.3)  

    def forward(self, x):
        x = F.selu(self.pool(self.dropout2d(self.conv1(x)))) # dropout 2D!
        x = F.selu(self.pool(self.dropout2d(self.conv2(x))))
        x = x.view(-1, 192 * 8 * 8)
        x = F.selu(self.fc1(x))
        x = F.selu(self.fc2(self.dropout(x)))
        x = self.fc3(self.dropout(x))
        return x


# Lenet with batch normalization 
class BatchNormLeNet(nn.Module):
    def __init__(self):
        super(BatchNormLeNet, self).__init__()
        self.conv1 = nn.Conv2d(3, 192, 5, padding=2)
        self.bn1 = nn.BatchNorm2d(192)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(192, 192, 5, padding=2)
        self.bn2 = nn.BatchNorm2d(192)
        self.fc1 = nn.Linear(192 * 8 * 8, 1024)
        self.bnf1 = nn.BatchNorm1d(1024)
        self.fc2 = nn.Linear(1024, 256)
        self.bnf2 = nn.BatchNorm1d(256)
        self.fc3 = nn.Linear(256, 10)

    def forward(self, x):
        # Since Selu purpose is to make batchnorm unnecesary, in this architecture we are going to use standard Relu
        x = F.relu(self.pool(self.bn1(self.conv1(x))))  
        x = F.relu(self.pool(self.bn2(self.conv2(x))))
        x = x.view(-1, 192 * 8 * 8)
        x = F.relu(self.bnf1(self.fc1(x)))
        x = F.relu(self.bnf2(self.fc2(x)))
        x = self.fc3(x)
        return x

And the + 1?

Here we are:

- Lenet with *Monte Carlo dropout*, with the very same architecture of `FullDropoutLeNet` during training, but different at test time; instead of multiplying the output of each neuron by its dropout coefficient, we continue to mask out neurons during evaluation, but collecting several predictions of the same samples and then taking the average, i.e. the same test sample is given to *different models of the dropout ensemble*. This will also allow us to reason about the uncertainty of the prediction. (`MonteCarloDropoutLeNet`)

Let's define two functions to wrap up the training and test pipelines. The test one should take into account the different test modality of `MonteCarloDropoutLeNet`. 

In [0]:
# We want to print the training loss every log_freq batches
log_freq = len(trainset)//batch_size  # default 

def train(epoch, net, optimizer, loss_func, log_freq=log_freq):
    running_loss = 0.0
    for i, data in enumerate(trainloader, start=1):
        # get the inputs
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = loss_func(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if (i) % log_freq == 0:    # print every log_freq mini-batches
            print('[Epoch : %d, Iter: %5d] loss: %.3f' %
                  (epoch + 1, i, running_loss / log_freq))
            running_loss = 0.0
    return running_loss / log_freq

def test(net, is_MCDO=False, train_data=False):
    if train_data:
        print("Accuracy on training data")
        dataloader = trainloader
    else:
        print("Accuracy on test data")
        dataloader = testloader
    class_correct = list(0. for i in range(10))
    class_total = list(0. for i in range(10))
    with torch.no_grad():
        for batch_idx, data in enumerate(dataloader):
            if batch_idx == len(testloader):
                break
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)
            output = 0
            if not is_MCDO:
                output = net(inputs)
            else:
                for i in range(20):  # number of different predictions
                    output += F.softmax(net(inputs), dim=1) / 20.
                output = torch.log(output)
            _, predicted = torch.max(output, 1)
            c = (predicted == labels).squeeze()
            for i in range(len(labels)):
                label = labels[i]
                class_correct[label] += c[i].item()
                class_total[label] += 1


    for i in range(10):
        print('Accuracy of %5s : %.2f %%' % (
            classes[i], 100 * class_correct[i] / class_total[i]))

    test_score = np.mean([100 * class_correct[i] / class_total[i] for i in range(10)])
    print(test_score)
    return test_score

>⚠️ **WARNING** The following cell will start the training of the models, it will take about 1 hour and 30 minutes. You can go over it without training, the next section will load a pretrained model.

Each model is saved in the `SAVE_PATH` at the end of its training, if you want to store the trained models beyond any *Colab Runtime disconnected* you can mount your Google Drive and set `SAVE_PATH` inside your drive. To mount your Drive open the Files menu on the left (Folder icon).

> **FUTURE EXERCISE** (After having read the second part of this tutorial about visualization toolkits) Add the code needed to monitor this training using a visualization toolkit of your choice. Are the weights of the dropout models closer to zero?  


In [0]:
from tqdm.notebook import tqdm

run_training = True  #@param {type:"boolean"}
SAVE_PATH = '/content/'  #@param {type:"string"}

# the architecture of MonteCarloDropoutLeNet is the same of FullDropoutLeNet
lenets = [FullDropoutLeNet, StdDropoutLeNet, VanillaLeNet, BatchNormLeNet]

epoch_num = 75
test_freq = 5
losses = []
net_scores = {lenet.__name__ : [] for lenet in lenets}
net_scores['MonteCarloDropoutLeNet'] = []
net_tr_scores = {lenet.__name__ : [] for lenet in lenets}
net_tr_scores['MonteCarloDropoutLeNet'] = []   
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

if run_training:
    for lenet in lenets:
        print(lenet.__name__, 'training')
        net = lenet()
        net.to(device)
        
        learning_rate = 5e-4
        loss_func = nn.CrossEntropyLoss()
        optimizer = optim.Adam(net.parameters(), lr=learning_rate, weight_decay=0.0005, amsgrad=True)
        scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.75)
        
        for i in tqdm(range(epoch_num)):
            net.train()
            loss_avg = train(epoch=i, net=net, optimizer=optimizer, loss_func=loss_func)
            losses.append(loss_avg)
            scheduler.step()

            if (i+1) % test_freq == 0:
                if lenet.__name__ == 'FullDropoutLeNet':
                    print('FullDropoutLeNet TEST')
                    net.eval()
                    net_score = test(net)
                    net_scores['FullDropoutLeNet'].append(net_score)
                    net_tr_score = test(net, train_data=True)
                    net_tr_scores['FullDropoutLeNet'].append(net_tr_score)
                    print('MonteCarloDropoutLeNet TEST')
                    net.train()
                    net_score = test(net,is_MCDO=True)
                    net_scores['MonteCarloDropoutLeNet'].append(net_score)
                    net_tr_score = test(net, is_MCDO=True, train_data=True)
                    net_tr_scores['MonteCarloDropoutLeNet'].append(net_tr_score)
                else:
                    net.eval()
                    net_score = test(net)
                    net_scores[lenet.__name__].append(net_score)
                    net_tr_score = test(net, train_data=True)
                    net_tr_scores[lenet.__name__].append(net_tr_score)
        torch.save(net.state_dict(), SAVE_PATH + lenet.__name__ + '.pt')
                    
                

FullDropoutLeNet training


HBox(children=(IntProgress(value=0, max=75), HTML(value='')))

[Epoch : 1, Iter:  1562] loss: 1.509
[Epoch : 2, Iter:  1562] loss: 1.129
[Epoch : 3, Iter:  1562] loss: 1.001
[Epoch : 4, Iter:  1562] loss: 0.928
[Epoch : 5, Iter:  1562] loss: 0.861
FullDropoutLeNet TEST
Accuracy on test data
Accuracy of plane : 77.40 %
Accuracy of   car : 75.00 %
Accuracy of  bird : 53.70 %
Accuracy of   cat : 49.80 %
Accuracy of  deer : 70.30 %
Accuracy of   dog : 54.60 %
Accuracy of  frog : 86.00 %
Accuracy of horse : 75.20 %
Accuracy of  ship : 79.10 %
Accuracy of truck : 79.90 %
70.1
Accuracy on training data
Accuracy of plane : 82.22 %
Accuracy of   car : 84.11 %
Accuracy of  bird : 64.15 %
Accuracy of   cat : 61.85 %
Accuracy of  deer : 78.38 %
Accuracy of   dog : 61.62 %
Accuracy of  frog : 88.80 %
Accuracy of horse : 84.77 %
Accuracy of  ship : 84.65 %
Accuracy of truck : 88.42 %
77.89582420664777
MonteCarloDropoutLeNet TEST
Accuracy on test data
Accuracy of plane : 78.10 %
Accuracy of   car : 78.70 %
Accuracy of  bird : 52.30 %
Accuracy of   cat : 53.90 %


HBox(children=(IntProgress(value=0, max=75), HTML(value='')))

[Epoch : 1, Iter:  1562] loss: 1.405
[Epoch : 2, Iter:  1562] loss: 1.009
[Epoch : 3, Iter:  1562] loss: 0.866
[Epoch : 4, Iter:  1562] loss: 0.776
[Epoch : 5, Iter:  1562] loss: 0.699
Accuracy on test data
Accuracy of plane : 83.50 %
Accuracy of   car : 84.80 %
Accuracy of  bird : 57.60 %
Accuracy of   cat : 55.90 %
Accuracy of  deer : 81.10 %
Accuracy of   dog : 54.60 %
Accuracy of  frog : 75.90 %
Accuracy of horse : 72.20 %
Accuracy of  ship : 80.80 %
Accuracy of truck : 83.50 %
72.99
Accuracy on training data
Accuracy of plane : 90.31 %
Accuracy of   car : 94.19 %
Accuracy of  bird : 73.74 %
Accuracy of   cat : 71.48 %
Accuracy of  deer : 91.57 %
Accuracy of   dog : 70.26 %
Accuracy of  frog : 83.32 %
Accuracy of horse : 83.60 %
Accuracy of  ship : 88.61 %
Accuracy of truck : 91.78 %
83.88636866347333
[Epoch : 6, Iter:  1562] loss: 0.630
[Epoch : 7, Iter:  1562] loss: 0.572
[Epoch : 8, Iter:  1562] loss: 0.524
[Epoch : 9, Iter:  1562] loss: 0.488
[Epoch : 10, Iter:  1562] loss: 0.4

HBox(children=(IntProgress(value=0, max=75), HTML(value='')))

[Epoch : 1, Iter:  1562] loss: 1.280
[Epoch : 2, Iter:  1562] loss: 0.905
[Epoch : 3, Iter:  1562] loss: 0.741
[Epoch : 4, Iter:  1562] loss: 0.612
[Epoch : 5, Iter:  1562] loss: 0.513
Accuracy on test data
Accuracy of plane : 79.00 %
Accuracy of   car : 80.70 %
Accuracy of  bird : 63.20 %
Accuracy of   cat : 45.50 %
Accuracy of  deer : 61.50 %
Accuracy of   dog : 80.50 %
Accuracy of  frog : 76.90 %
Accuracy of horse : 74.00 %
Accuracy of  ship : 80.90 %
Accuracy of truck : 76.40 %
71.85999999999999
Accuracy on training data
Accuracy of plane : 89.89 %
Accuracy of   car : 93.51 %
Accuracy of  bird : 85.02 %
Accuracy of   cat : 67.96 %
Accuracy of  deer : 81.49 %
Accuracy of   dog : 91.98 %
Accuracy of  frog : 88.33 %
Accuracy of horse : 88.12 %
Accuracy of  ship : 92.34 %
Accuracy of truck : 92.09 %
87.07179335021272
[Epoch : 6, Iter:  1562] loss: 0.416
[Epoch : 7, Iter:  1562] loss: 0.349
[Epoch : 8, Iter:  1562] loss: 0.291
[Epoch : 9, Iter:  1562] loss: 0.259
[Epoch : 10, Iter:  156

HBox(children=(IntProgress(value=0, max=75), HTML(value='')))

[Epoch : 1, Iter:  1562] loss: 1.163
[Epoch : 2, Iter:  1562] loss: 0.885
[Epoch : 3, Iter:  1562] loss: 0.783
[Epoch : 4, Iter:  1562] loss: 0.703
[Epoch : 5, Iter:  1562] loss: 0.634
Accuracy on test data
Accuracy of plane : 70.00 %
Accuracy of   car : 85.30 %
Accuracy of  bird : 65.10 %
Accuracy of   cat : 54.70 %
Accuracy of  deer : 75.00 %
Accuracy of   dog : 72.00 %
Accuracy of  frog : 83.60 %
Accuracy of horse : 69.30 %
Accuracy of  ship : 90.80 %
Accuracy of truck : 88.40 %
75.41999999999999
Accuracy on training data
Accuracy of plane : 79.54 %
Accuracy of   car : 91.86 %
Accuracy of  bird : 79.37 %
Accuracy of   cat : 69.66 %
Accuracy of  deer : 86.11 %
Accuracy of   dog : 81.05 %
Accuracy of  frog : 87.93 %
Accuracy of horse : 81.70 %
Accuracy of  ship : 96.26 %
Accuracy of truck : 92.76 %
84.62253320549279
[Epoch : 6, Iter:  1562] loss: 0.560
[Epoch : 7, Iter:  1562] loss: 0.490
[Epoch : 8, Iter:  1562] loss: 0.418
[Epoch : 9, Iter:  1562] loss: 0.369
[Epoch : 10, Iter:  156

Finally we can look at the trends of the accuracy on seen and unseen data during training. What considerations can you make looking at the plot? What can be the causes of these trends?


In [0]:
#@title Accuracy across epochs plot
fig = go.Figure()
for lenet_name, lenet_score in net_scores.items():
    x = np.arange(len(lenet_score))
    fig.add_trace(go.Scatter(x=x, y=lenet_score, mode='lines', name=lenet_name + ' test'))
for lenet_name, lenet_score in net_tr_scores.items():
    x = np.arange(len(lenet_score))
    fig.add_trace(go.Scatter(x=x, y=lenet_score, mode='lines', name=lenet_name + ' train'))

fig.update_layout(
        title='Accuracy across epochs',
        xaxis_title="epoch / test frequency",
        yaxis_title="Accuracy")

fig.show()
print(f'test frequency: {test_freq}')

test frequency: 5


**SPOILER** (make your own considerations before reading over)

Before any comparison between models, we notice a strong memorization; not only the accuracy on training data is much higher than the one on test data, but also reaching 100% is suspicious. In all likelihood the models have memorized the labels of some training samples. After all the size of the training set is very limited for this kind of problem. 

Concerning the two regulirizers we are working on, both of them improved the results of the vanilla architecture, with batchnorm performing slightly better than dropout.

> **EXERCISE** After these results a bunch of questions arise. Can we mitigate the memorization effect augmenting the training dataset with some transformations? What happen if we rise the dropout coefficient? And if we apply dropout also on the input layer? What is the performance of a model with both batch normalization and dropout, in which order you should place them? Does the performance of `MonteCarloDropoutLeNet` increase if we rise the number of predictions from 20 to 50 or 100?
>
> Answer to one or more of these or new questions and send to us by email your findings, accompained by a brief discussion.

### Experimenting with uncertainty

In this section we will see first-hand the ensemble behind a single neural net trained with dropout, looking closer at the predictions of each model in the ensemble.

First of all we load a pretrained `MonteCarloDropoutLeNet`.

In [0]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1-0HX-AriuL8JOmknPHPvWgjdFfDhmSCx' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1-0HX-AriuL8JOmknPHPvWgjdFfDhmSCx" -O MonteCarloDropoutLeNet.pt && rm -rf /tmp/cookies.txt

--2020-05-01 13:06:05--  https://docs.google.com/uc?export=download&confirm=&id=1-0HX-AriuL8JOmknPHPvWgjdFfDhmSCx
Resolving docs.google.com (docs.google.com)... 74.125.142.101, 74.125.142.139, 74.125.142.138, ...
Connecting to docs.google.com (docs.google.com)|74.125.142.101|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-00-6o-docs.googleusercontent.com/docs/securesc/vnrq3a06c55r651ufee9913usk1k44ba/3o0klo73p48vj45191d4ra4qtebkoatm/1588338300000/02704480380348180023/07046652426960477723Z/1-0HX-AriuL8JOmknPHPvWgjdFfDhmSCx?e=download [following]
--2020-05-01 13:06:08--  https://doc-00-6o-docs.googleusercontent.com/docs/securesc/vnrq3a06c55r651ufee9913usk1k44ba/3o0klo73p48vj45191d4ra4qtebkoatm/1588338300000/02704480380348180023/07046652426960477723Z/1-0HX-AriuL8JOmknPHPvWgjdFfDhmSCx?e=download
Resolving doc-00-6o-docs.googleusercontent.com (doc-00-6o-docs.googleusercontent.com)... 74.125.195.132, 2607:f8b0:400e:c09::84
Connecting to 

In [0]:
model = FullDropoutLeNet()
model.load_state_dict(torch.load('/content/MonteCarloDropoutLeNet.pt'))
model = model.to(device)

In [0]:
model.train()
test(model, is_MCDO=True)

Accuracy on test data
Accuracy of plane : 81.00 %
Accuracy of   car : 88.90 %
Accuracy of  bird : 69.50 %
Accuracy of   cat : 56.70 %
Accuracy of  deer : 71.60 %
Accuracy of   dog : 71.60 %
Accuracy of  frog : 86.80 %
Accuracy of horse : 83.20 %
Accuracy of  ship : 87.30 %
Accuracy of truck : 84.80 %
78.13999999999999


78.13999999999999

Since we want to test the uncertainty of the predictions of our model, we want to craft a difficult classification task.  
Let's find two samples visually similar but with a different label in the test set.

In [0]:
inputs = [testset[i][0] for i in range(36)]
class_idx = [testset[i][1] for i in range(36)]

visualize_samples(inputs)  

What about the dog and the horse at the beginnning of the second line? 

If we flip the horse horizontally they are even closer. Let's prepare a smooth transition between the two images in 12 steps, we are going to analyze the predictions of our model on each of these.

In [0]:
inputs_or = [testset[12][0], torch.flip(testset[13][0], dims=[2])]  #flip on dimension 2 to flip horizontally
inputs = [inputs_or[0] * (1 - i) + inputs_or[1] * i for i in np.linspace(0,1,num=12)]
visualize_samples(inputs)

Let's start looking at the labels predicted on each of these images by `MonteCarloDropoutLeNet`, so averaging over 100 predictions.

Monte Carlo Dropout requires several predictions for the same sample to work, making test time more computationally expensive and time consuming. Nevertheless we have GPUs, and we can place copies of the same sample in a batch. Until the batch fits in the GPU memory we can make these multiple predictions without slowdowns.

> **EXERCISE** Parallelize the following code removing the second `for`. Can you parallelize it even more removing also the first `for`?

In [0]:
model.train()
with torch.no_grad():
    for i, sample in enumerate(inputs):
        sample = sample.to(device)
        output = torch.zeros(10, device=device)

        for j in range(100):
            single_output = model(sample.unsqueeze(0))
            output += single_output[0] / 100.

        class_index = torch.argmax(output).item()
        print(i, classes[class_index])


0 dog
1 dog
2 dog
3 dog
4 dog
5 dog
6 horse
7 horse
8 horse
9 horse
10 horse
11 horse


In [0]:
# @title Solution 👀 
model.train()
with torch.no_grad():
    for i, sample in enumerate(inputs):
        sample = sample.to(device)
        sample = sample.repeat(100, 1, 1, 1)
        outputs = model(sample)
        output = torch.einsum('il -> l', F.softmax(outputs, dim=1) / 100.)
        class_index = torch.argmax(output).item()
        print(i, classes[class_index])

0 dog
1 dog
2 dog
3 dog
4 dog
5 dog
6 horse
7 horse
8 horse
9 horse
10 horse
11 horse


Finally we want to look at the distribution of activations on the last layer on the neurons corresponding to the dog and horse classes, both before and after performing the softmax.

In [0]:
model.train()
all_outputs = []
all_soft_outputs = []
with torch.no_grad():
    for sample in inputs:
        sample = sample.to(device)
        sample = sample.repeat(100, 1, 1, 1)
        outputs = model(sample)
        soft_outputs = F.softmax(outputs, dim=1)
        all_outputs.append(outputs.to('cpu').numpy())
        all_soft_outputs.append(soft_outputs.to('cpu').numpy())

Let's make a nice plot

In [0]:
visualize_samples(inputs)
for k, output_sequence in enumerate([all_outputs, all_soft_outputs]):
    if k == 0:
        title = 'outputs of the neurons corresponding to dog and horse of the last layer before softmax (100 forward passes with dropout)' 
    else:
        title = 'outputs of the neurons corresponding to dog and horse of the last layer after softmax (100 forward passes with dropout)' 
    fig = go.Figure()
    ndfp = output_sequence[0].shape[0]  # number_of_different_forward_passes
    x_dogs = np.zeros(len(output_sequence) * ndfp)
    y_dogs = np.zeros(len(output_sequence) * ndfp)
    x_horses = np.zeros(len(output_sequence) * ndfp)
    y_horses = np.zeros(len(output_sequence) * ndfp)
    for i, output in enumerate(output_sequence):
        x_dogs[i * ndfp: (i+1) * ndfp] += i
        y_dogs[i * ndfp: (i+1) * ndfp] = output[:,5]
        x_horses[i * ndfp: (i+1) * ndfp] += i
        y_horses[i * ndfp: (i+1) * ndfp] = output[:,7]

    fig.add_trace(go.Scatter(x=x_dogs, y=y_dogs,
                        mode='markers',
                        name='dogs',
                        marker=dict(
                            size=50,
                            opacity=0.1,
                            symbol='line-ew',
                            line=dict(width=6, color='deepskyblue'))))
    fig.add_trace(go.Scatter(x=x_horses, y=y_horses,
                        mode='markers',
                        name='horses',
                        marker=dict(
                            size=50,
                            opacity=0.1,
                            symbol='line-ew',
                            line=dict(width=6, color='salmon'))))
    fig.update_layout(
        title=title,
        xaxis_title="image",
        yaxis_title="neuron activation",
        xaxis_type='category')

    fig.show()


First of all let's look at the results on the central images *before* the softmax, the more ambiguous ones. As expected the activations of the dog and horse neurons are quite close; the distributions of these activations over the ensemble of models are overlapping. Differently from the first and last images where the two distributions are well separated. 

Look now at the activations *after* the softmax, despite the higher ambiguity of the central images, we have several models in the ensemble with the maximum activation of 1. It should be now clear that we can not attribute a 100% confidence to a classification based on a softmax close to one. 

The softmax output of a class can be arbitrarily high if its mean is far from the means of the other classes.

Instead we can evaluate the **uncertainty** of our predictions using Monte Carlo dropout and looking at the **overlapping of the distributions** of activations before the softmax.

> **EXERCISE** Look at the distribution of activations on new images (even a combination of more than two, or maybe now taking two very different images, or instead very close, or with the same label, or...). Send us by email your plots and a brief discussion about your findings.

# The Deep Learning Toolset

We are observing an increasing number of great tools that help facilitate the  deep learning process, making it both more accessible and more efficient.
Keep in mind that together with DL specific tools, we continue to need general purpose tools and best practices common to every project.

A catalog of available machine learning tools can be found  [here](https://github.com/josephmisiti/awesome-machine-learning), the following landscape can be useful to understand which problem each tool tries to solve:

[![](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/07/pics/tools.jpg)](https://medium.com/luminovo/the-deep-learning-toolset-an-overview-b71756016c06)

The number of available tools is overwhelming.

In the following sections we will briefly explore some of the general purpose and DL-specific tools for IA researchers (some of which are not suitable when using Jupyter notebooks or Google colaboratory).

## Model Versioning

Several tools exists to monitor the training, organize experiments, log parameters and metrics. Especially useful is the possibility to perform comparisons between parameters or metrics of different experiments.

[TensorBoard](https://www.tensorflow.org/tensorboard), [MLFlow](https://mlflow.org/) tracking and [Weights&Biases](https://www.wandb.com) are two examples of tools that allow the tracking of model experiments.
MLFlow is an open source framework, Weights&Biases is a commercial service (free for academics) with the possibility to use their cloud storage, TensorBoard is more a visualization toolkit than a model versioning system. 

### Weight & Biases


In this section we will see how to use Weights&Biases. We chose it over MLFlow since it is easier to setup in a Colab environment (MLFlow requires you to start a server somewhere).

#### Setup

We must install and import W&B. It is also necessary to login into your W&B account, that you should create if you want to use W&B.

In [0]:
! pip install wandb

Collecting wandb
[?25l  Downloading https://files.pythonhosted.org/packages/1d/17/7f632c1c700758822f9e41aa7b025ea15f017f2a43611efa7b64341303ea/wandb-0.8.34-py2.py3-none-any.whl (1.4MB)
[K     |████████████████████████████████| 1.4MB 6.0MB/s 
Collecting gql==0.2.0
  Downloading https://files.pythonhosted.org/packages/c4/6f/cf9a3056045518f06184e804bae89390eb706168349daa9dff8ac609962a/gql-0.2.0.tar.gz
Collecting shortuuid>=0.5.0
  Downloading https://files.pythonhosted.org/packages/25/a6/2ecc1daa6a304e7f1b216f0896b26156b78e7c38e1211e9b798b4716c53d/shortuuid-1.0.1-py3-none-any.whl
Collecting watchdog>=0.8.3
[?25l  Downloading https://files.pythonhosted.org/packages/73/c3/ed6d992006837e011baca89476a4bbffb0a91602432f73bd4473816c76e2/watchdog-0.10.2.tar.gz (95kB)
[K     |████████████████████████████████| 102kB 10.9MB/s 
Collecting sentry-sdk>=0.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/20/7e/19545324e83db4522b885808cd913c3b93ecc0c88b03e037b78c6a417fa8/sentry_sdk-0.14

In [0]:
import wandb

In [0]:
# WandB – Login to your wandb account so you can log all your metrics
!wandb login

[34m[1mwandb[0m: You can find your API key in your browser here: https://app.wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter: ac91dcfcb1dfd039a743977adfe86f4565102f18
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[32mSuccessfully logged in to Weights & Biases![0m


#### Define the Neural Network

In [0]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        # In our constructor, we define our neural network architecture that we'll use in the forward pass.
        # Conv2d() adds a convolution layer that generates 2 dimensional feature maps to learn different aspects of our image
        self.conv1 = nn.Conv2d(3, 6, kernel_size=5)
        self.conv2 = nn.Conv2d(6, 16, kernel_size=5)
        
        # Linear(x,y) creates dense, fully connected layers with x inputs and y outputs
        # Linear layers simply output the dot product of our inputs and weights.
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Here we feed the feature maps from the convolutional layers into a max_pool2d layer.
        # The max_pool2d layer reduces the size of the image representation our convolutional layers learnt,
        # and in doing so it reduces the number of parameters and computations the network needs to perform.
        # Finally we apply the relu activation function which gives us max(0, max_pool2d_output)
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2(x), 2))
        
        # Reshapes x into size (-1, 16 * 5 * 5) so we can feed the convolution layer outputs into our fully connected layer
        x = x.view(-1, 16 * 5 * 5)
        
        # We apply the relu activation function and dropout to the output of our fully connected layers
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        
        # Finally we apply the softmax function to squash the probabilities of each class (0-9) and ensure they add to 1.
        return F.log_softmax(x, dim=1)

#### Define the Training Loop

In [0]:
def train(args, model, device, train_loader, optimizer, epoch):
    # Switch model to training mode. This is necessary for layers like dropout, batchnorm etc which behave differently in training and evaluation mode
    model.train()
    
    # We loop over the data iterator, and feed the inputs to the network and adjust the weights.
    for _, (data, target) in zip(range(20), train_loader):  # Limit to the first 20 batches in this demo

        # Load the input features and labels from the training dataset
        data, target = data.to(device), target.to(device)
        
        # Reset the gradients to 0 for all learnable weight parameters
        optimizer.zero_grad()
        
        # Forward pass: Pass image data from training dataset, make predictions about class image belongs to (0-9 in this case)
        output = model(data)
        
        # Define our loss function, and compute the loss
        loss = F.nll_loss(output, target)
        
        # Backward pass: compute the gradients of the loss w.r.t. the model's parameters
        loss.backward()
        
        # Update the neural network weights
        optimizer.step()

#### Define the Evaluation Step

Here we add a line of code to:

- **wandb.log()** – Log your metrics (accuracy, loss and epoch) and examples of images along with the predicted and true labels. This allows you to visualize your neural network's performance over time.



In [0]:
def test(args, model, device, test_loader, classes):
    # Switch model to evaluation mode. This is necessary for layers like dropout, batchnorm etc which behave differently in training and evaluation mode
    model.eval()
    test_loss = 0
    correct = 0

    example_images = []
    with torch.no_grad():
        for data, target in test_loader:
            # Load the input features and labels from the test dataset
            data, target = data.to(device), target.to(device)
            
            # Make predictions: Pass image data from test dataset, make predictions about class image belongs to (0-9 in this case)
            output = model(data)
            
            # Compute the loss sum up batch loss
            test_loss += F.nll_loss(output, target, reduction='sum').item()
            
            # Get the index of the max log-probability
            pred = output.max(1, keepdim=True)[1]
            correct += pred.eq(target.view_as(pred)).sum().item()
            
            # WandB – Log images in your test dataset automatically, along with predicted and true labels by passing pytorch tensors with image data into wandb.Image
            example_images.append(wandb.Image(
                data[0], caption="Pred: {} Truth: {}".format(classes[pred[0].item()], classes[target[0]])))
    
    # WandB – wandb.log(a_dict) logs the keys and values of the dictionary passed in and associates the values with a step.
    # You can log anything by passing it to wandb.log, including histograms, custom matplotlib objects, images, video, text, tables, html, pointclouds and other 3D objects.
    # Here we use it to log test accuracy, loss and some test images (along with their true and predicted labels).
    wandb.log({
        "Examples": example_images,
        "Test Accuracy": 100. * correct / len(test_loader.dataset),
        "Test Loss": test_loss})

#### Train, Edit, and Retrain
Run `wandb.init(project="my-project")` each time you start a new run. It automatically creates the project for you if it doesn't exist. Runs of the training script above will sync to a project named "my-project". See the [`wandb.init`](https://docs.wandb.com/library/init) documentation for more initialization options.

### Initialize Hyperparameters

Here we add a few lines of code to:
*   **wandb.init()** – Initialize a new W&B run. Each run is a single execution of the training script.
*   **wandb.config** – Save all your hyperparameters in a config object. This lets you use W&B to sort and compare your runs by hyperparameter values.

We encourage you to tweak these and run this cell again to see if you can achieve improved model performance!

### Track Results
*   **wandb.watch()** – Fetch all layer dimensions, gradients, model parameters and log them automatically to your dashboard.
*   **wandb.save()** – Save the model checkpoint.


Read the [documentation](https://docs.wandb.com/library) for the details.

In [0]:
# WandB – Initialize a new run
wandb.init(project="pytorch-intro")
#wandb.watch_called = False # Re-run the model without restarting the runtime, unnecessary after our next release

# WandB – Config is a variable that holds and saves hyperparameters and inputs
config = wandb.config          # Initialize config
config.batch_size = 4          # input batch size for training (default: 64)
config.test_batch_size = 10    # input batch size for testing (default: 1000)
config.epochs = 10             # number of epochs to train (default: 10)
config.lr = 0.1               # learning rate (default: 0.01)
config.momentum = 0.1          # SGD momentum (default: 0.5) 
config.no_cuda = False         # disables CUDA training
config.seed = 42               # random seed (default: 42)
config.log_interval = 10     # how many batches to wait before logging training status

def main():
    use_cuda = not config.no_cuda and torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")
    kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
    
    # Set the random seeds we log into w&b
    random.seed(config.seed)       # python random seed
    torch.manual_seed(config.seed) # pytorch random seed
    np.random.seed(config.seed) # numpy random seed
    torch.backends.cudnn.deterministic = True

    # Load the dataset: We're training our CNN on CIFAR10 (https://www.cs.toronto.edu/~kriz/cifar.html)
    # First we define the tranformations to apply to our images
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
    
    # Now we load our training and test datasets and apply the transformations defined above
    train_loader = torch.utils.data.DataLoader(datasets.CIFAR10(root='./data', train=True,
                                              download=True, transform=transform), batch_size=config.batch_size,
                                              shuffle=True, **kwargs)
    test_loader = torch.utils.data.DataLoader(datasets.CIFAR10(root='./data', train=False,
                                             download=True, transform=transform), batch_size=config.test_batch_size,
                                             shuffle=False, **kwargs)

    classes = ('plane', 'car', 'bird', 'cat',
               'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

    # Initialize our model, recursively go over all modules and convert their parameters and buffers to CUDA tensors (if device is set to cuda)
    model = Net().to(device)
    optimizer = optim.SGD(model.parameters(), lr=config.lr,
                          momentum=config.momentum)
    
    # WandB – wandb.watch() automatically fetches all layer dimensions, gradients, model parameters and logs them automatically to your dashboard.
    # Using log="all" log histograms of parameter values in addition to gradients
    wandb.watch(model, log="all")

    for epoch in range(1, config.epochs + 1):
        train(config, model, device, train_loader, optimizer, epoch)
        test(config, model, device, test_loader, classes)
        
    # WandB – Save the model checkpoint. This automatically saves a file to the cloud and associates it with the current run.
    torch.save(model.state_dict(), "model.h5")
    wandb.save('model.h5')

if __name__ == '__main__':
    main()

Files already downloaded and verified
Files already downloaded and verified




#### See Live Results
Check out your project page on W&B!

![project page](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/07/pics/wandb_live.png)


#### Visualize Gradients
Click through to a single run to see more details about that run. For example, you can see the gradients that have been logged.

![gradients](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/07/pics/wandb_grad.png)


#### Visualize Predictions
You can visualize predictions made at everystep by clicking on the Media tab. Here we can see an example of true labels and predictions made by our model on the CIFAR dataset.

![predictions](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/07/pics/wandb_pred.png)


#### Review Code
The overview tab picks up a link to the code. In this case, it's a link to the Google Colab. If you're running a script from a git repo, we'll pick up the SHA of the latest git commit and give you a link to that version of the code in your own GitHub repo.

![overview](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/07/pics/wandb_code.png)

#### Visualize Relationships
Use a parallel coordinates chart to see the relationship between hyperparameters and output metrics. Here, I'm looking at how the learning rate and other metrics I saved in "config" affect my loss and accuracy.

![parallel coordinates plot](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/07/pics/wandb_rel.gif)

#### More about Weights & Biases

Here are some more resources:

1. [Documentation](http://docs.wandb.com) - Python docs
2. [Gallery](https://app.wandb.ai/gallery) - example reports in W&B
3. [Articles](https://www.wandb.com/articles) - blog posts and tutorials

---

Section adapted from this [tutorial](https://www.wandb.com/articles/intro-to-pytorch-with-wandb)

## Data and model exploration

Being able to visualize interactively the dataset and the output of the model is very important in order to be able to grasp what is happening.

There are many tools that you can use to play interactively with your code.
Two of them are:

**[Jupyter](https://jupyter.org/) notebooks or similar** (e.g. google [colaboratory](https://colab.research.google.com/))

At this point you should have a good understanding how the notebooks can be useful to interactively explore the data and models as you change hyperparameters.

**[Streamlit](https://www.streamlit.io/)**

Streamlit lets you create apps (that can be easily deployed as [awesome](https://awesome-streamlit.org/) web apps) for your machine learning projects with deceptively simple Python scripts. It supports hot-reloading, so your app updates live as you edit and save your file. No need to mess with HTTP requests, HTML, JavaScript, etc. All you need is your favorite editor and a browser.

This is the code for a minimal interactive app:

```python
import streamlit as st

x = st.slider('Select a value')
st.write(x, 'squared is', x * x)
```

Then, use `streamlit run <filename>` to launch the app:

![](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/07/pics/streamlit_min.png)


Streamlit supports several popular data charting libraries like Matplotlib, Plotly, GraphViz, Altair, Deck.Gl, and [more](https://docs.streamlit.io/api.html#display-charts).

Despite its simplicity (no callbacks, every time a widget is touched the whole file is re-executed!) Streamlit lets you build incredibly rich and powerful tools:

![](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/07/pics/streamlit1.gif)
![](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/07/pics/streamlit2.gif)


## Generic tools and code best practices

### Be system independent

You want your code to be system independent. 
There are many common problems that arise if you fail to do so:

- If you change system or change the configuration of your computer (e.g. the username), the project won't work anymore and you will have to skim through the whole codebase to change all the system-specific constants.

- It becomes of the utmost importance when working in team and using git. If you put system-specific constants (e.g. file system paths) in a file tracked by git, each member of the team will try to modify this file to match the constants to his system. This raises never ending conflicts.

- If you use system-specific constants in configuration files, e.g. the path to a dataset, that configuration file will not be usable in other system to reproduce the experiment. You should put the *name* of the dataset in the configuration file, not its path. Then, this *name* should be an environment variable associated to the path in the current system.

The best practice to manage system-specific constants is to use **environment variables**.

Define all the system-specific constants in a file `.env` in the root of your project, using the bash syntax (mind the spaces):
```bash
export VAR1=value1
export VAR2=value2
```
Put the `.env` file in the `.gitignore`. When changing system you will only need to change this non git-tracked file. Optionally you can track in git a `.env.template` to remember which variables are needed.

Once you have this file, it is possible to read the variables from everywhere:

- Bash scripts: source it `. .env`
- Makefile: include it `include .env`
- Python: use [`python-dotenv`](https://pypi.org/project/python-dotenv/)

Environment variables are very common, and most tools support them. For example, to use them in a Caddyfile just source `.env` before running [caddy](https://caddyserver.com/), then use the syntax `{$VAR1}` in the Caddyfile to access the variables.

---

The next level is to make your entire environment system-independent, not only your code (e.g. without manually installing all the dependencies). 
You can do this using [Docker](https://www.docker.com/).
Keep in mind that even if you use Docker, it is a good idea to write system-independent code. What if someone in your team can't use Docker?

### Git

![](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/07/pics/xgit.png)

As with many great things in life, Git began with a bit of creative destruction and fiery controversy.

The Linux kernel is an open source software project of fairly large scope. For most of the lifetime of the Linux kernel maintenance (1991–2002), changes to the software were passed around as patches and archived files. In 2002, the Linux kernel project began using a proprietary DVCS called BitKeeper.

In 2005, the relationship between the community that developed the Linux kernel and the commercial company that developed BitKeeper broke down, and the tool’s free-of-charge status was revoked. This prompted the Linux development community (and in particular Linus Torvalds, the creator of Linux) to develop their own tool based on some of the lessons they learned while using BitKeeper. Some of the goals of the new system were as follows:

- Speed
- Simple design
- Strong support for non-linear development (thousands of parallel branches)
- Fully distributed
- Able to handle large projects like the Linux kernel efficiently (speed and data size)

Since its birth in 2005, Git has evolved and matured to be easy to use and yet retain these initial qualities. It’s amazingly fast, it’s very efficient with large projects, and it has an incredible branching system for non-linear development (See [Git Branching](https://git-scm.com/book/en/v2/Git-Branching-Branches-in-a-Nutshell#ch03-git-branching)).

---

If you never used git or you want to make better use of it, I **highly recommend** to read (at least) the first three chapters of this [book](https://git-scm.com/book/en/v2).

Some other useful resources:
- [Learn Git](https://www.atlassian.com/git/tutorials/what-is-version-control) by Atlassian is a great resource, alternative to the book.
- [Here](https://try.github.io/) there are great visualization tools to better understand the git tree
- [This](https://gitexplorer.com/) is a tool to find the right commands you need without digging through the web.

#### **Quickstart to Git**

If you already know the basics of Git or made the **wise decision to read the [book](https://git-scm.com/book/en/v2)** or the Atlassian [tutorials](https://www.atlassian.com/git/tutorials/what-is-version-control), you can entirely skip this section.

##### **What's a version control system?**
Version control systems are a category of software tools that help a software team manage changes to source code over time. Version control software keeps track of every modification to the code in a special kind of database. If a mistake is made, developers can turn back the clock and compare earlier versions of the code to help fix the mistake while minimizing disruption to all team members.

##### **Benefits of version control systems**

Developing software without using version control is risky, like not having backups. Version control can also enable developers to move faster and it allows software teams to preserve efficiency and agility as the team scales to include more developers.

- A complete long-term change history of every file. Having the complete history enables going back to previous versions to help in root cause analysis for bugs and it is crucial when needing to fix problems in older versions of software.
- Branching and merging. Creating a "branch" in VCS tools keeps multiple streams of work independent from each other while also providing the facility to merge that work back together, enabling developers to verify that the changes on each branch do not conflict. 
- Traceability. Being able to trace each change made to the software and connect it to project management and bug tracking software such as Jira. (not so important in research).



##### **Snapshots, Not Differences**

The major difference between Git and any other VCS (Subversion and friends included) is the way Git thinks about its data. Conceptually, most other systems store information as a list of file-based changes. These other systems (CVS, Subversion, Perforce, Bazaar, and so on) think of the information they store as a set of files and the changes made to each file over time (this is commonly described as delta-based version control).

![](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/07/pics/gitdeltas.png)

Git doesn’t think of or store its data this way. Instead, Git thinks of its data more like a series of snapshots of a miniature filesystem. With Git, every time you commit, or save the state of your project, Git basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot. To be efficient, if files have not changed, Git doesn’t store the file again, just a link to the previous identical file it has already stored. Git thinks about its data more like a stream of snapshots.

![](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/07/pics/gitsnapshots.png)

This is an important distinction between Git and nearly all other VCSs. It makes Git reconsider almost every aspect of version control that most other systems copied from the previous generation. This makes Git more like a mini filesystem with some incredibly powerful tools built on top of it, rather than simply a VCS.

##### **Nearly Every Operation Is Local**

Most operations in Git need only local files and resources to operate — generally no information is needed from another computer on your network. If you’re used to a CVCS where most operations have that network latency overhead, this aspect of Git will make you think that the gods of speed have blessed Git with unworldly powers. Because you have the entire history of the project right there on your local disk, most operations seem almost instantaneous.

For example, to browse the history of the project, Git doesn’t need to go out to the server to get the history and display it for you — it simply reads it directly from your local database. This means you see the project history almost instantly. If you want to see the changes introduced between the current version of a file and the file a month ago, Git can look up the file a month ago and do a local difference calculation, instead of having to either ask a remote server to do it or pull an older version of the file from the remote server to do it locally.

This also means that there is very little you can’t do if you’re offline or off VPN. If you get on an airplane or a train and want to do a little work, you can commit happily (to your local copy, remember?) until you get to a network connection to upload. If you go home and can’t get your VPN client working properly, you can still work. In many other systems, doing so is either impossible or painful. In Perforce, for example, you can’t do much when you aren’t connected to the server; in Subversion and CVS, you can edit files, but you can’t commit changes to your database (because your database is offline). This may not seem like a huge deal, but you may be surprised what a big difference it can make.


##### **Git Has Integrity**

Everything in Git is checksummed before it is stored and is then referred to by that checksum. This means it’s impossible to change the contents of any file or directory without Git knowing about it. This functionality is built into Git at the lowest levels and is integral to its philosophy. You can’t lose information in transit or get file corruption without Git being able to detect it.

The mechanism that Git uses for this checksumming is called a SHA-1 hash. This is a 40-character string composed of hexadecimal characters (0–9 and a–f) and calculated based on the contents of a file or directory structure in Git. A SHA-1 hash looks something like this:

`24b9da6552252987aa493b52f8696cd6d3b00373`

You will see these hash values all over the place in Git because it uses them so much. In fact, Git stores everything in its database not by file name but by the hash value of its contents.


##### **Git Generally Only Adds Data**

When you do actions in Git, nearly all of them only add data to the Git database. It is hard to get the system to do anything that is not undoable or to make it erase data in any way. As with any VCS, you can lose or mess up changes you haven’t committed yet, but after you commit a snapshot into Git, it is very difficult to lose, especially if you regularly push your database to another repository.

This makes using Git a joy because we know we can experiment without the danger of severely screwing things up.

##### **The Three States**

Pay attention now — here is the main thing to remember about Git if you want the rest of your learning process to go smoothly. Git has three main states that your files can reside in: modified, staged, and committed:

- Modified means that you have changed the file but have not committed it to your database yet.

- Staged means that you have marked a modified file in its current version to go into your next commit snapshot.

- Committed means that the data is safely stored in your local database.

This leads us to the three main sections of a Git project: the working tree, the staging area, and the Git directory.

![](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/07/pics/gitareas.png)


The working tree is a single checkout of one version of the project. These files are pulled out of the compressed database in the Git directory and placed on disk for you to use or modify.

The staging area is a file, generally contained in your Git directory, that stores information about what will go into your next commit. Its technical name in Git parlance is the “index”, but the phrase “staging area” works just as well.

The Git directory is where Git stores the metadata and object database for your project. This is the most important part of Git, and it is what is copied when you clone a repository from another computer.

The basic Git workflow goes something like this:

- You modify files in your working tree.

- You selectively stage just those changes you want to be part of your next commit, which adds only those changes to the staging area.

- You do a commit, which takes the files as they are in the staging area and stores that snapshot permanently to your Git directory.

If a particular version of a file is in the Git directory, it’s considered committed. If it has been modified and was added to the staging area, it is staged. And if it was changed since it was checked out but has not been staged, it is modified. 

##### **Branches in a Nutshell**
When you make a commit, Git stores a commit object that contains a pointer to the snapshot of the content you staged. This object also contains the author’s name and email, the message that you typed, and pointers to the commit or commits that directly came before this commit (its parent or parents): zero parents for the initial commit, one parent for a normal commit, and multiple parents for a commit that results from a merge of two or more branches.

A branch in Git is simply a lightweight movable pointer to one of these commits. The default branch name in Git is master. As you start making commits, you’re given a master branch that points to the last commit you made. Every time you commit, it moves forward automatically.

How does Git know what branch you’re currently on? It keeps a special pointer called HEAD. 

![](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/07/pics/gitbranches.png)

Because a branch in Git is in actuality a simple file that contains the 40 character SHA-1 checksum of the commit it points to, branches are cheap to create and destroy. Creating a new branch is as quick and simple as writing 41 bytes to a file (40 characters and a newline).

Read more on branches [here](https://git-scm.com/book/it/v2/Git-Branching-Branches-in-a-Nutshell).

##### **Basic Git commands**


Command  | Description
------------- | -------------
 |
`git init <directory>` |  Create empty Git repo in specified directory. <br> Run with no arguments to initialize the current directory as a git repository
 | 
`git config user.name <name>` | Define author name to be used for all commits in current repo. <br> Devs commonly use `--global` flag to set config options for current user.
 |
`git clone <repo>` | Clone repo located at `<repo>` onto local machine. <br> Original repo can be located on the local filesystem or on a remote machine via HTTP or SSH. 
 | 
`git add <directory>`  | Stage all changes in `<directory>` for the next commit. Replace `<directory>` with a `<file>` to change a specific file
 |
`git commit -m "<message>" ` | Commit the staged snapshot, but instead of launching a text editor, use `<message>` as the commit message
 |
`git status` | List which files are staged, unstaged, and untracked.
 |
`git log` | Display the entire commit history using the default format. For customization see additional options.
 |
`git diff` | Show unstaged changes between your index and working directory
 |
`git branch` | List all of the branches in your repo. Add a `<branch>` argument to create a new branch with the name `<branch>`.
 |
`git checkout -b <branch>` | Create and check out a new branch named `<branch>`. Drop the -b flag to checkout an existing branch.
 |
`git merge <branch>` |  Merge `<branch>` into the current branch.
 |
`git rebase <base>` | Rebase the current branch onto `<base>`. `<base>` can be a commit ID, branch name, a tag, or a relative reference to HEAD.
 |
`git rebase -i <base>` | Interactively rebase current branch onto <base>.  <br>Launches editor to enter commands for how each commit will be transferred to the new base.
 |
`git remote add <name> <url>` | Create a new connection to a remote repo. <br> After adding a remote, you can use `<name>` as a shortcut for `<url>` in other commands.
 | 
`git fetch <remote> <branch>` | Fetches a specific `<branch>`, from the repo. Leave off `<branch>` to fetch all remote refs.
 |
`git pull <remote>` | Fetch the specified remote’s copy of current branch and immediately merge it into the local copy.
 |
`git push <remote> <branch>` | Push the branch to `<remote>`, along with necessary commits andobjects. <br> Creates named branch in the remote repo if it doesn’t exist.
 |

More on Merging vs Rebasing [here](https://www.atlassian.com/git/tutorials/merging-vs-rebasing).

---

Section adapted from the [book](https://git-scm.com/book/en/v2) and the Atlassian [tutorials](https://www.atlassian.com/git/tutorials/what-is-version-control) 





#### Git Workflows

There are many possible workflows that can be used to organize the project lifecycle.

- [Centralized Workflow](https://www.atlassian.com/git/tutorials/comparing-workflows#centralized-workflow)
- [Feature Branch Workflow](https://www.atlassian.com/git/tutorials/comparing-workflows/feature-branch-workflow)
- [Gitflow Workflow](https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow)
- [GitHub flow](https://guides.github.com/introduction/flow/)
- [Forking Workflow](https://www.atlassian.com/git/tutorials/comparing-workflows/forking-workflow)

Usually for research projects there is no need for complex workflows that scale up to big projects and teams with strict deployment policies.

One simple workflow is the [*squash-rebase workflow*](https://blog.carbonfive.com/always-squash-and-rebase-your-git-commits/).

##### **Squash-rebase workflow**

It’s simple – before you merge a feature branch back into your main branch (often `master` or `develop`), your feature branch should be squashed down to a single buildable commit, and then rebased from the up-to-date main branch. Here’s a breakdown.

###### **Create a new branch**

Work on new features on new branches:

```python
# Pull the master branch
git checkout master              
git pull

# Create branch feature
git branch "your-feature-branch"
git checkout "your-feature-branch"

# Make changes as needed with as many commits that you need to. 
# Make sure the final commit is buildable and all tests pass.
...
git add .
git commit -m "My new awesome feature"
```

One example of a feature branch may be the implementation of the early stop mechanism. If you take longer than what you thought to implement it, you can always stop working on that feature and continue later.

Your main branch is unaffected by the changes!

###### **Squash the commits**

When the feature is ready, squash or reorganize the commits in this branch in a sensible way.

You may procede in three different ways:

**1**) Count the commits to squash.

Get the number of commits from the start of your branch. There are a couple of ways to get this. You can simply `git log` and count your commits, or `git log --graph --decorate --pretty=oneline --abbrev-commit` which will show a graph of your commit log history and may be easier to visualize your commits. 

Squash to one (or few important) commits with:
```python
git rebase -i HEAD~[NUMBER OF COMMITS]
```

---

**2**) Use the commit SHA 

Sometimes you will have large enough number of commits that counting can become troublesome. In that case grab the SHA from the last commit that your branch branches from.


Squash to one (or few important) commits with:
```python
git rebase -i [SHA]
```

---

**3**) Interactive rebase local changes

If your branch is only local, you can simply squash the local commits with an interactive rebase:
```python
git rebase -i
```

---

If you have previously pushed your code to a remote branch, you will need to force push `git push origin branchName --force`. You are the only one working on your feature branch. Never push force the branches shared with other people!

###### **Rebase from master**

Rebase with master
(i.e. modify your branch applying the changes on your branch on top of the current state of the master branch);
```python
git checkout master; git pull;   # pull new master changes
git checkout "your-feature-branch"
git rebase master
```

Handle any conflicts and make sure your code builds and all tests pass. Force push branch to remote.

```git push origin branchName --force```

###### **Merge to master**

Merge your rebase branch with the master and push:

```python
git checkout master
git merge "your-feature-branch"
git push
```

###### **Pro**

- All commits in master build and pass tests. This has serveral advantages, e.g. you can use `git bisect` to understand which commit introduced a new bug.

- It makes handling conflicts from rebasing simple. Since you’ve squashed down to one commit, you only have to deal with those conflicts once, rather than having to work against half-baked code. It reduces the risk of losing code when dealing with the conflicts.

- Cleaner `git log`

![](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/07/pics/messylog.png)
![](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/07/pics/cleanlog.png)

###### **Cons**

- We are rewriting the history and it is generally a bad thing
- We are losing the granularity of commits
- It is difficult to work in team on a single feature
- `git rebase` may be dangeurous if you don't know what you're doing

---

Section adapted from this [blog post](https://blog.carbonfive.com/always-squash-and-rebase-your-git-commits/).



### Types annotations

Type Annotations are a "new" feature added in [PEP 484](https://www.python.org/dev/peps/pep-0484/) that allow for adding *type hints* to variables or functions. They are used to inform someone reading the code what the type of a variable or function *should be*.

```python
# Function that takes a str, a bool and returns a str.
def greeting(name: str, excited: bool = False) -> str:
    message = 'Hello, {}'.format(name)
    if excited:
        message += '!!!'
    return message
```

#### Type annotations are optional

1. You can’t break the code by adding them
1. They provide no performance gain (not [yet](https://github.com/python/mypy/blob/master/mypyc/doc/getting_started.rst))! 
1. You may add them only where you see fit

#### Some benefits of type annotations

1. We can employ **static code analysis** to catch type errors prior to runtime (with `mypy` or with IDEs integration, e.g. PyCharm defaults to highlighting type errors)

1. **Cleaner code/the code is self-documenting** — “don’t use comments to specify a type, when you can use type annotation”. Comments tend to wear out and rot, while code is alive and must stay fresh. Change the types of the variables without changing the comment specifying the type — nothing happens. Change it without changing the type annotation? Your static analysis tool, whichever it may be, will shout at you. 

1. **Better code completion** — since IDEs are informed about the type of each variable.

---

Read more in the [PEP 484](https://www.python.org/dev/peps/pep-0484/) and in the [typing](https://docs.python.org/3/library/typing.html) module.

> Curiosity: read about [Dropbox journey to type checking 4 million lines of Python](https://dropbox.tech/application/our-journey-to-type-checking-4-million-lines-of-python)


### Code autoformatter

Use a code autoformatter on save. 

One great example is the [Black](https://black.readthedocs.io/en/stable/) autoformatter, that will format the code for you (e.g. when you save the file) and it can be integrated in most IDEs. 

> Black is the uncompromising Python code formatter. 
> By using it, you agree to **cede control over minutiae of hand-formatting**. 
>
> In return, Black gives you speed, determinism, and freedom from *pycodestyle* nagging about formatting. 
> You will **save time and mental energy for more important matters**.
>
> Blackened code looks the same regardless of the project you're reading. Formatting becomes transparent after a while and you can **focus on the content instead**.
>
> Black makes code review faster by producing the **smallest diffs possible**.
>
> Watch the [PyCon 2019](https://www.youtube.com/watch?v=esZLCuWs_2Y&feature=youtu.be) talk to learn more.
