<a href="https://colab.research.google.com/github/Aayush360/Fast_AI/blob/master/training_a_state_of_the_art_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#hide
!pip install fastai --upgrade
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

Mounted at /content/gdrive


In [2]:
#hide
from fastbook import *
from fastai.vision.widgets import *

In [3]:
# download and extract the data

path = untar_data(URLs.IMAGENETTE)

In [4]:
path.ls()

(#2) [Path('/root/.fastai/data/imagenette2/val'),Path('/root/.fastai/data/imagenette2/train')]

In [5]:
parent_label(path)

'data'

### Baseline Model

In [6]:
# get our datsets into dataloaders objects and using presizing

dblock = DataBlock(blocks=(ImageBlock,CategoryBlock),
                   get_items= get_image_files,
                   get_y = parent_label,
                   item_tfms = Resize(460),
                   batch_tfms = aug_transforms(size=224, min_scale=0.75))

dls = dblock.dataloaders(path,bs=64)

In [8]:
# let's do a training that will serve as a baseline

model = xresnet50()
learn = Learner(dls, model, loss=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5,3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.60859,2.537468,0.356236,02:55
1,1.205611,1.850787,0.512323,02:56
2,0.965782,0.984648,0.689694,02:56
3,0.728013,0.732605,0.773712,02:56
4,0.599703,0.5856,0.815907,02:56


### Normalization

(to make mean 0 and std. dev of 1)

when working with model that are being trained form scratch, or fine-tuned to a very different dataset from the one used for pretraining, some additional techniques are important





In [9]:
# let's use standard imagenet mean and std. dev provided by fastai

In [7]:
def get_dls(bs, size):
  dblock = DataBlock(blocks=(ImageBlock, CategoryBlock),
                     get_items = get_image_files,
                     get_y = parent_label,
                     item_tfms = Resize(460),
                     batch_tfms = [*aug_transforms(size=size, min_scale=0.75),
                                   Normalize.from_stats(*imagenet_stats)])
  return dblock.dataloaders(path,bs=bs)

In [15]:
dls = get_dls(64, 224)

In [16]:
x,y = dls.one_batch()

In [17]:
x.mean(dim=[0,2,3]), x.std(dim=[0,2,3])

(TensorImage([-0.0313,  0.0090,  0.0224], device='cuda:0'),
 TensorImage([1.2185, 1.2251, 1.3035], device='cuda:0'))

In [18]:
# let's see what effect it had on our training

In [19]:
model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5,3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.648656,6.249081,0.320762,02:56
1,1.229609,1.618819,0.549664,02:56
2,0.974689,2.048304,0.521658,02:55
3,0.750263,0.698057,0.779313,02:55
4,0.599795,0.543447,0.82599,02:56


### Progressive Reisizing

start training using smaller images and end the training using larger images
- using smaller images helps training complete faster
- using larger image during the completion, makes the final accuracy much higher

- it is a data augmentation techniques, thus the model that are trained with progressive resizing generalizes well 

In [8]:
dls = get_dls(128,128)
learn = Learner(dls, xresnet50(),loss_func= CrossEntropyLossFlat(),metrics=accuracy)
learn.fit_one_cycle(4,3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.875263,3.431871,0.375653,03:01
1,1.311638,1.380955,0.575803,03:02
2,0.990952,0.905073,0.734503,03:00
3,0.754376,0.714095,0.78118,03:01


In [10]:
# now fine_tune the model using the increased size 

In [9]:
dls = get_dls(64,224)
learn.fine_tune(5,1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,0.741726,1.092988,0.648992,03:01


epoch,train_loss,valid_loss,accuracy,time
0,0.627448,0.718774,0.776699,02:58
1,0.631106,0.717662,0.78118,03:00
2,0.57695,0.631951,0.807692,02:59
3,0.514913,0.576335,0.823002,02:59
4,0.466125,0.552052,0.830471,02:59


### Test Time Augmentation (TTA)

instead of cropping from center portion of the image from the validation set( which can miss out important details for that image) we use multiple crops from different locations (provided), pass them through the model and take the maximum or average of the predictions.
We can do this not just for cropping but for different values across all our test time augmentation parameters.

- this increase the amount of time required for validation or inference by the number of test-time augmentede images requested

In [16]:
preds, targs = learn.tta()
accuracy(preds,targs).item()

0.8278566002845764

### Mixup

constructs virtual training example
- takes weighted average of selected image and random image, and also weighted average of selected image's label and random image's label.
- weight is also selected as random
- we need labels of the dataset to be one-hot encoded

In [11]:
# we use mixup callback, callbacks are used to inject custom behaviour in the training loop

In [13]:
model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy, cbs=MixUp)
learn.fit_one_cycle(5,3e-3)


epoch,train_loss,valid_loss,accuracy,time
0,1.943651,2.493498,0.351755,06:05
1,1.665869,1.574992,0.486931,06:04
2,1.412369,1.166129,0.625093,06:05
3,1.241076,0.770172,0.76363,06:04
4,1.15327,0.640036,0.807692,06:04


In [14]:
# harder to train because it's harder to see what's in each image, also has to predict two labels per each image
# requires far more epochs to train to get better accuracy compared to other augmentation approches
# problem with mixup- changing the labels also changes the data augmentation - to handle this issue we use label smoothing

### Label smoothing

- replace all 1's with number a bit less than 1, and 0's by a number a bit greater than 0
- makes training more robust even if there is mislabeled data, results in a model that generalizes better at inference
- prevent model from predicting something overconfidently

- replace 0 with 0+(e/N) and 1 by 1-e+(e/N) , e is epsilon (0.1 usually, meaning we are 10% unsure that the labels is 0) and N is the number of classes, they should sum up to 1, 0+(e/N)+1-e+(e/N)=1

In [15]:
model = xresnet50()
learn = Learner(dls, model,loss_func=LabelSmoothingCrossEntropy(),
                metrics=accuracy)
learn.fit_one_cycle(5,3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,2.512515,2.829167,0.457431,06:03
1,2.152401,2.159655,0.62584,06:02
2,1.923842,2.129009,0.649365,06:02
3,1.719157,1.62871,0.795743,06:02
4,1.602434,1.573407,0.822255,06:02


In [None]:
# like with mixup, with label smoothing no general improvements until you train for longer epochs

#### when we want to prototype quick experiments on a new datasets - make a small subset of datset that is the representative of the entire datset