<a href="https://colab.research.google.com/github/Ranjani94/Advanced_Deep_Learning/blob/master/Assignment_2/MixUp_CIFAR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##FastAI Mixup Augmentation, Label smoothing, Test Time Augmentation, Progressive Resizing

Importing Fastbook. These notebooks cover an introduction to deep learning, fastai, and PyTorch. fastai is a layered API for deep learning

In [1]:
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

[K     |████████████████████████████████| 727kB 4.4MB/s 
[K     |████████████████████████████████| 51kB 7.6MB/s 
[K     |████████████████████████████████| 61kB 6.9MB/s 
[K     |████████████████████████████████| 1.0MB 19.2MB/s 
[K     |████████████████████████████████| 358kB 28.7MB/s 
[K     |████████████████████████████████| 92kB 9.2MB/s 
[K     |████████████████████████████████| 40kB 6.4MB/s 
[K     |████████████████████████████████| 40kB 6.8MB/s 
[K     |████████████████████████████████| 51kB 8.0MB/s 
[K     |████████████████████████████████| 61kB 9.8MB/s 
[K     |████████████████████████████████| 2.6MB 31.1MB/s 
[?25hMounted at /content/gdrive


In [2]:
from fastbook import *

###Using CIFAR 10 dataset for mixup augmentation

In [4]:
from fastai.vision.all import *
path = untar_data(URLs.CIFAR)

###Loading data from FastAi library

In [5]:
dblock = DataBlock(blocks=(ImageBlock(), CategoryBlock()),
                   get_items=get_image_files,
                   get_y=parent_label,
                   item_tfms=Resize(460),
                   batch_tfms=aug_transforms(size=224, min_scale=0.75))
dls = dblock.dataloaders(path, bs=64)

###Visualization

###Training CIFAR dataset using pre trained model ResNet50, loss function as cross Entropy

When working with models that are being trained from scratch, or fine-tuned to a very different dataset than the one used for the pretraining, there are some additional techniques that are really important. 

In [6]:
model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.430894,1.906714,0.4015,11:16
1,1.033549,1.046034,0.629833,11:22
2,0.793301,0.84358,0.7115,11:22
3,0.620378,0.583736,0.796417,11:22
4,0.527582,0.512798,0.82425,11:22


###Normalization

Normalizing the training data is very important in order to reduce bias in dataset ( mean 0 and standard deviation 1), but in computer vision the values are between 0 and 255 

In [9]:
x,y = dls.one_batch()
x.mean(dim=[0,2,3]),x.std(dim=[0,2,3])

(TensorImage([0.4828, 0.4759, 0.4275], device='cuda:0'),
 TensorImage([0.2182, 0.2161, 0.2406], device='cuda:0'))

The mean and standard deviation are not very close to the desired values. Fortunately, normalizing the data is easy to do in fastai by adding the Normalize transform.

In [10]:
def get_dls(bs, size):
    dblock = DataBlock(blocks=(ImageBlock, CategoryBlock),
                   get_items=get_image_files,
                   get_y=parent_label,
                   item_tfms=Resize(460),
                   batch_tfms=[*aug_transforms(size=size, min_scale=0.75),
                               Normalize.from_stats(*imagenet_stats)])
    return dblock.dataloaders(path, bs=bs)

In [11]:
dls = get_dls(64, 224)

In [12]:
x,y = dls.one_batch()
x.mean(dim=[0,2,3]),x.std(dim=[0,2,3])

(TensorImage([-0.0640, -0.0176,  0.0812], device='cuda:0'),
 TensorImage([1.0912, 1.0718, 1.1660], device='cuda:0'))

After Normalization, the effects on our model evaluation is shown

In [13]:
model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.432934,1.313992,0.527583,11:24
1,1.001459,0.868518,0.691,11:23
2,0.785417,0.691331,0.758167,11:23
3,0.575566,0.505399,0.8235,11:23
4,0.482419,0.444031,0.84475,11:22


Normalization becomes especially important when using pretrained models. The pretrained model only knows how to work with data of the type that it has seen before. If the average pixel value was 0 in the data it was trained with, but your data has 0 as the minimum possible value of a pixel, then the model is going to be seeing something very different to what is intended.

###Progressive Resizing

Start training using small images, and end training using large images. Spending most of the epochs training with small images, helps training complete much faster. Completing training using large images makes the final accuracy much higher. This approach is called progressive resizing.

Progressive resizing is another form of data augmentation. Therefore, we expect to see better generalization of your models that are trained with progressive resizing. To implement progressive resizing it is most convenient if we first create a get_dls function which takes an image size and a batch size as we did in the section before, and returns the DataLoaders. Now we can create DataLoaders with a small size and use fit_one_cycle in the usual way.

In [14]:

dls = get_dls(128, 128)
learn = Learner(dls, xresnet50(), loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy)
learn.fit_one_cycle(4, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.31699,1.673635,0.483083,04:57
1,0.904789,0.945283,0.666,04:58
2,0.641992,0.566225,0.808417,04:58
3,0.488755,0.465645,0.839,04:57


We can replace the DataLoaders inside the Learner, and fine-tune

In [15]:
learn.dls = get_dls(64, 224)
learn.fine_tune(5, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,0.652872,0.627547,0.786333,11:22


epoch,train_loss,valid_loss,accuracy,time
0,0.52622,0.512269,0.820833,11:23
1,0.49731,0.478509,0.832417,11:24
2,0.423831,0.425962,0.854917,11:23
3,0.342455,0.357712,0.87675,11:23
4,0.319672,0.336151,0.88275,11:23


We can see there is improvement in performance and training small dataset takes much less time compared to training the whole large dataset at a time. 

For transfer learning, progressive resizing may actually hurt performance. This is most likely to happen if your pretrained model was quite similar to your transfer learning task and dataset and was trained on similar-sized images, so the weights don't need to be changed much. In that case, training on smaller images may damage the pretrained weights.

###Test Time Augmentation

We could try to apply data augmentation to the validation set. Up until now, we have only applied it on the training set; the validation set always gets the same images. But maybe we could try to make predictions for a few augmented versions of the validation set and average them.

Random cropping in fastai often crops the center part of the image and not focus on the edges for the purpose of data augmentation and this might lead to a problem. Instead of doing center crop for validation we can crop the different areas of the image and pass all of them to the model to make the predictions. Instead of using different crops we can use different values as in test time augmentation paramaters.

Depending on the dataset, test time augmentation can result in dramatic improvements in accuracy. It does not change the time required to train at all, but will increase the amount of time required for validation or inference by the number of test-time-augmented images requested. By default, fastai will use the unaugmented center crop image plus four randomly augmented images.

In [16]:
preds,targs = learn.tta()
accuracy(preds, targs).item()

0.890999972820282

###Mixup

Mixup is a powerful data augmentation technique which can lead to higher accuracy of our model when we dont have enough data to train our model and having no pretrained model similar to our dataset. The paper explains: "While data augmentation consistently leads to improved generalization, the procedure is dataset-dependent, and thus requires the use of expert knowledge."

Mixup works as follows, for each image:

- Select another image from your dataset at random.
- Pick a weight at random.
- Take a weighted average (using the weight from step 2) of the selected image - with your image; this will be your independent variable.
- Take a weighted average (with the same weight) of this image's labels with your image's labels; this will be your dependent variable.

In [20]:
model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy, cbs=MixUp())
learn.fit_one_cycle(5, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.766482,1.70173,0.410917,05:00
1,1.459237,1.416896,0.512,05:00
2,1.249006,0.801754,0.733,05:00
3,1.113351,0.630541,0.79475,05:00
4,1.043892,0.533257,0.831667,05:00


###Cutout data Augmentation

A function cutout() is used to randomly display (with a probability of p) black squares in an image (number and size between min and max), forcing the ConvNet network to consider the context and not just learning to recognize features in isolation

In [25]:
# model = xresnet50()
# learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), 
#                 metrics=accuracy, cbs=cutout())
# learn.fit_one_cycle(3, 3e-3)

###Label Smoothing

In most of the categorical dataset, targets are in the form of one  hot encoded which is the model is trained to return 0 for all categories but one, for which it is trained to return 1. Even 0.999 is not "good enough", the model will get gradients and learn to predict activations with even higher confidence. This encourages overfitting and gives you at inference time a model that is not going to give meaningful probabilities: it will always say 1 for the predicted category even if it's not too sure, just because it was trained this way. This is very dangerous if the data is not properly labelled.

Instead, we could replace all our 1s with a number a bit less than 1, and our 0s by a number a bit more than 0, and then train. This is called label smoothing. By encouraging your model to be less confident, label smoothing will make your training more robust, even if there is mislabeled data. The result will be a model that generalizes better.

In [19]:
model = xresnet50()
learn = Learner(dls, model, loss_func=LabelSmoothingCrossEntropy(), 
                metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,2.272966,2.321267,0.49225,04:59
1,1.916819,2.14839,0.599417,04:59
2,1.664524,1.668528,0.755,04:58
3,1.496097,1.465727,0.832583,04:58
4,1.414968,1.400136,0.859667,04:58


We have applied all the methods to create the state of the art model in computer vision for the CIFAR 10 dataset. However more epochs are needed to see if the model is avoiding overfitting and results in high accuracy.