<h2>Purpose:</h2>
    <ul>
        <li>Loading data using Pytorch datasets</li>
          <dd>- Learning about DataSet and DataLoader</dd>
        <li>Building a feedforward Neural Network to classify images</li>
        <li>Understanding classification loss</li>
    </ul>

    
    
    

<h2>Overview:</h2>
<ul>
    <li>This chapter starts with introducing module <i>torchvision.datasets</i>. This module provides several ready to use datasets, such as CIFAR10, ImageNet, MNIST, Omniglot, .... Visit <a url='https://pytorch.org/vision/stable/datasets.html'>here</a> to find the full list.</li>
    <li>Takes CIFAR dataset as an example to illustrate how perform preprocessing of an imaging data</li>
    <li> Then demonstrates how to build a Fully Connected Neural Network to classify images</li>
</ul>


<h3>CIFAR10 dataset:</h3>
<ul>
    <li>60,000 color (RGB) images of size 32*32</li>
    <li> 10 classes: airplane (0), automobile (1), bird (2), cat (3),
1 deer (4), dog (5), frog (6), horse (7), ship( 8), andtruck (9)</li>
    <li><span style="color:Tomato;"><b>Note:</b></span> Sample images in CIFAR10 dataset of Pytorch are instances of RGB PIL images </li>
</ul>
    

<h3>Dataset Transforms: torchvision.transforms</h3>
    <p> <b>torchvision.transfroms</b> module provides several composable image tranformations. 
    Transformations can be applied separately or can be chained using <b>torchvision.tranforms.Compose</b>. </p>
    <p>Compose accept a list of transformations and apply them in order.</p>

In [23]:
import torch
from torchvision.datasets import CIFAR10
import torchvision.transforms as transforms
cifar_mean = torch.tensor([0.4914, 0.4822, 0.4465])
cifar_std = torch.tensor([0.2470, 0.2435, 0.2616])
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=cifar_mean, std=cifar_std)])
cifar10 = CIFAR10('/Users/lidasafarnejad/PycharmProjects/pytorchBook/data/', train=True, download=False, transform=transform)
cifar10_val = CIFAR10('/Users/lidasafarnejad/PycharmProjects/pytorchBook/data/',train=False,download=False,transform=transform)


<h3>Explanation</h3>
<p>In the cell above, we construct to objects from the class CIFAR10 dataset; cifar10 to be used for training and cifar10_val to be used for validation</p>
<p>In the cell bellow, a smaller dataset, called cifar2, is extracted </p>

In [24]:
label_map = {0: 0, 2: 1}
class_names = ['airplane', 'bird']
cifar2 = [(img, label_map[label]) for img, label in cifar10 if label in [0, 2]]
cifar2_val = [(img, label_map[label]) for img, label in cifar10_val if label in [0, 2]]

<h2> Build a Classifier </h2>

<p>After preparing the dataset, cifar2 in this example, the book continues to build a simple classifier using the module <i><b>torch.nn</b></i></p>

<p>Here are some notes about this model: </p>
<ul>
    <li>This model is a Fully Connected Neural Network (FNN). This means that the model is a <span style='background-color:#FFFF00;'>sequence</span> of <span style='background-color:#FFFF00;'>linear layers</span> and of course <span style='background-color:#FFFF00;'>activation functions</span></li>
    <li>In Pytorch: </li>
    <ul>
        <li>sequentiality is implemented using <i><b>nn.Sequential</b></i> module</li>
        <li>linear layers are objects of <i><b>nn.Linear</b></i> class</li>
        <li>activation functions are objects of <i><b>n.Tanh</b></i> class</li>
        </ul>
    <li>Finally the output goes through <i><b>nn.LogSoftmax</b></i>  to output two values: 1) the probability of being of the class airplane, 2) The probability of being of the class bird (or vice versa). <span color='red'>Note</span> that we use LogSoftmax, thus; the output is the logarithm of the probabilities</li>
            <li>The model must be trained to maximize the probability of the correct class</li>
            <li>The model needs a loss function. MSE loss function is <span style='color:red;'>NOT</span> a good choice in this scenario because the model aims to predict probabilities of classes rather than predicting a continious value. </li>
            <li>The loss function must be high when it outputs a low probability for the correct class 
                and it must be low when it outputs a high probability for the correct class. 
                Negative Log Likelihood (NLL) is a good choice in this scenario. </li>
        </ul> 
    


The model then looks like this:

In [26]:
import torch.nn as nn
model = nn.Sequential( 
            nn.Linear(3072, 512),
            nn.Tanh(), nn.Linear(512, 2), 
            nn.LogSoftmax(dim=1))

In [33]:
loss_fn = nn.NLLLoss()

Now we can do <span style='background-color:#FFFF00;'>inference</span> for one sample image:

In [34]:
img, label = cifar2[0]
out = model(img.view(-1).unsqueeze(0)) 
loss_fn(out, torch.tensor([label]))

tensor(0.6549, grad_fn=<NllLossBackward0>)

<b>Note:</b> The combination of nn.LogSoftmax and  nn.NLLLoss is nn.CrossEntropyLoss; thus, the model can be changed as follows:

In [38]:
model = nn.Sequential( 
            nn.Linear(3072, 512),
            nn.Tanh(), nn.Linear(512, 2), 
            )
loss_fn = nn.CrossEntropyLoss()

<h2>Training a classifier</h2>

<p>The purpose of training is estimating the parameters of model that we have built. <span style='background-color:yellow;'>Gradient descent</span> is an iterative algorithm to estimate these parameters. In every iteration of the training process, parameters are adjusted such that the loss function is minimized. </p>


In [39]:
learning_rate = 1e-2
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
n_epochs = 1
for epoch in range(n_epochs):
    for img, label in cifar2:
        out = model(img.view(-1).unsqueeze(0)) 
        loss = loss_fn(out, torch.tensor([label]))
        optimizer.zero_grad() 
        loss.backward() 
        optimizer.step()
        print("Epoch: %d, Loss: %f" % (epoch, float(loss)))
        break

Epoch: 0, Loss: 0.960388


<h3>Batch Gradient Descent vs Minibatch Gradient Descent vs Stochastic Gradient Descent (SGD)</h3>
<p>In batch (or vanilla) gradient descent, in every epoch, the loss is computed for every single sample and accumulated before taking one gradient step. In other words, the loss is computed over the entire batch of samples. </p>
<p>In SGD, in every epoch, a gradient descent step is taken after computing loss for every single sample. In other words, parameters are updated after computing loss for every single sample</p>
<p>Mini-batch is kind of the middle between the above two extremes. In every epoch, the dataset is broken into smaller subsets or mini-batches of samples. Then, the gradient descent step is taken after computing loss over a mini-batch. </p>
<p>To further understand the difference between these approaches, you can watch <a url='https://www.youtube.com/watch?v=W9iWNJNFzQI'>this</a> and <a url='https://www.youtube.com/watch?v=4qJaSmvhxi8'>this</a> videos by Andrew Ng</p>    

<h2>DataLoader and Mini-Batch</h2>
<p>In the simple model that we built above, we used the class <i><b>torch.optim.SGD</b></i> class to apply the idea of SGD. To implement mini-batch gradient descent, we can use the class <i><b>torch.utils.data.DataLoader</b></i> </p>

In [41]:
train_loader = torch.utils.data.DataLoader(cifar2, batch_size=64, shuffle=True)

In [44]:
learning_rate = 1e-2
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
n_epochs = 1
for epoch in range(n_epochs):
    for imgs, labels in train_loader:
        batch_size = imgs.shape[0]
        outputs = model(imgs.view(batch_size, -1)) 
        loss = loss_fn(outputs, labels)
        optimizer.zero_grad() 
        loss.backward() 
        optimizer.step()

<span style='color:red;'></span>

<span style='color:red;'>Note:</span> In the previousy built model <b>WITHOUT</b> <i><b>DataLoader</b></i>, the optimizer torch.optim.SGD takes one gradient step and updates parameters over a single sample. By creating a DataLoader object and using it in the training loop, the DataLoader creates mini-batches of size 64 and the optimizer SGD knows that it must takes gradient steps and updates the parameters over the mini-batches of 64 samples.

<h2>Validation</h2>

<p>After training over the training dataset, we should validate the resulted model over another set of samples which has not been shown to the model during the training phase. This will let us know how well the model will be <b>generalized</b> to unseen samples.</p>

In [49]:
val_loader = torch.utils.data.DataLoader(cifar2, batch_size=64, shuffle=False)
correct = 0
total = 0

with torch.no_grad():
    for imgs, lables in val_loader:
        batch_size = imgs.shape[0]
        outputs = model(imgs.view(batch_size,-1))
        _, predicted = torch.max(outputs, dim=1)
        total += lables.shape[0]
        correct += int((predicted==lables).sum())

print(f'Accuracy: {correct / total}')


Accuracy: 0.7775


<p><b>But why validating the model?<b></p>

<p>In real scenarios, after training and when the model is ready, it is deployed on a server and afterwards, it will be used to perform inference for samples which might have never be seen during the training phase. If we train a model with high accuracy on the training set but it does poorly over other samples, the model will be useless!  </p>

<b>Notes:</b>
<ul>
    <li>Some folks consider <i>validation</i> and <i>test</i> set to be the same, that is a set of samples to be used for validating the performance of the trained model; However, others consider them not to be the same. In other words, they break down a set of samples into THREE subsets: training set; to be used for training, a validation set; to be used for choosig the best set of hyper parameters, a test set; to be used to see how well a model is generalized. What we considered as a validation set in the example model above equavalents to the test set by this definition.</li>
    <li>The distribution of the validation set, the test set, and the training set should be the same. You cannot train a model over samples of cats and then expect it to identify kittens with a high accuracy!</li>
</ul>

<h2>A deeper model with more parameters</h2>

<p>The chapter continues on creating a more complicated model with more hidden layers. It discusses that having a complicated model does not necessarily lead to better results, or higher accuracy in this context. To elaborate more, the model might shows high accuracy over the training set, but that might not be the case for the test set. This means that a model with a large set of parameters is capable of MEMORIZING every details of the training set or in other words we are OVERFITTING the model. </p>

<h3>How to determine the number of the parameters of a model in Pytorch?</h3>


In [51]:
numel_list = [p.numel() for p in model.parameters() if p.requires_grad == True]
sum(numel_list), numel_list

(1574402, [1572864, 512, 1024, 2])

<h2>The limits of fully connected networks</h2>
<p>FCNs in general are not good choices when dealing with imaging datasets. A fully connected network relates a pixel to all other pixels. However, pixels are generally mostly relevant to their neighboring ones. In addition, FCNs are translation invariant. That is if the model is trained to capture one feature in a region, it is not able to reconize it in another region. As the an FCNs focus on every single, it will lose the context at which a pixel appears. This context is defined by the neighbing pixels of that pixel.</p>


<h3>Solution and Motivation for Chapter 8</h3>

<p>Convolutional Neural Networks (CNNs) were proposed to overcome the shortcommings of FCNs when dealing with imaging data. In such networks, the model learns several kernels which are able to extract local features and characteristics from an image. </p>