<a href="https://colab.research.google.com/github/MaggiePN92/fastai/blob/master/chap7_gc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install fastai --upgrade

Collecting fastai
[?25l  Downloading https://files.pythonhosted.org/packages/5b/53/edf39e15b7ec5e805a0b6f72adbe48497ebcfa009a245eca7044ae9ee1c6/fastai-2.3.0-py3-none-any.whl (193kB)
[K     |█▊                              | 10kB 17.9MB/s eta 0:00:01[K     |███▍                            | 20kB 17.6MB/s eta 0:00:01[K     |█████                           | 30kB 14.3MB/s eta 0:00:01[K     |██████▊                         | 40kB 13.6MB/s eta 0:00:01[K     |████████▌                       | 51kB 11.3MB/s eta 0:00:01[K     |██████████▏                     | 61kB 12.9MB/s eta 0:00:01[K     |███████████▉                    | 71kB 11.4MB/s eta 0:00:01[K     |█████████████▌                  | 81kB 12.4MB/s eta 0:00:01[K     |███████████████▏                | 92kB 10.8MB/s eta 0:00:01[K     |█████████████████               | 102kB 10.8MB/s eta 0:00:01[K     |██████████████████▋             | 112kB 10.8MB/s eta 0:00:01[K     |████████████████████▎           | 122kB 10.8MB

#Training a State-of-the-Art Model

Introduces:
- normalization
- Mixup
- progressive resizing
- test time augmentation



In [2]:
from fastai.vision.all import *

In [3]:
path = untar_data(URLs.IMAGENETTE)

In [4]:
dblock = DataBlock(
    blocks=(ImageBlock(), CategoryBlock()),
    get_items=get_image_files,
    get_y=parent_label,
    item_tfms=Resize(460),
    batch_tfms=aug_transforms(size=224, min_scale=0.75)
)
dls = dblock.dataloaders(path, bs=64)

In [5]:
#using un-pretrained model as baseline
model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.582231,2.929521,0.362584,05:20
1,1.211608,3.735433,0.323749,05:10
2,0.952245,1.588585,0.57655,05:11
3,0.722617,0.659636,0.796863,05:10
4,0.589602,0.552289,0.834205,05:10


In [6]:
#normalize data
x,y = dls.one_batch()
x.mean(dim=[0,2,3]), x.std(dim=[0,2,3])

(TensorImage([0.4834, 0.4915, 0.4598], device='cuda:0'),
 TensorImage([0.2723, 0.2677, 0.3010], device='cuda:0'))

In [7]:
def get_dls(bs, size):
  dblock = DataBlock(
      blocks=(ImageBlock(), CategoryBlock()),
      get_items=get_image_files,
      get_y=parent_label,
      item_tfms=Resize(460),
      batch_tfms=[*aug_transforms(size=size, min_scale=0.75),
                  Normalize.from_stats(*imagenet_stats)]
  )
  return dblock.dataloaders(path, bs=bs)

dls = get_dls(64, 224)

x,y = dls.one_batch()
x.mean(dim=[0,2,3]), x.std(dim=[0,2,3])

(TensorImage([-0.1730, -0.0487,  0.0951], device='cuda:0'),
 TensorImage([1.2039, 1.2084, 1.3074], device='cuda:0'))

In [9]:
#using un-pretrained model as baseline, now w/ norm data
model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.653275,1.752058,0.479089,05:09
1,1.251967,1.276721,0.632188,05:09
2,0.937295,1.124018,0.654593,05:10
3,0.731189,0.691861,0.789395,05:11
4,0.576136,0.538391,0.833831,05:10


#Progressive resizing: 
gradually using larger and larger images as you train. <br>
Start small bc training is faster, end with larger as acc gets better. <br> <br>


Can hurt performance if data and task is similar to pretrained model. On the contrary, if task and data is different; will probably improve performance. 


In [10]:
#progressive resizing:
dls = get_dls(128,128)
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(4, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,0.772921,0.93358,0.707618,03:07
1,0.771989,1.02146,0.678118,02:58
2,0.626747,0.567813,0.820015,02:57
3,0.494075,0.468738,0.852502,02:59


In [11]:
learn.dls = get_dls(64, 224)
learn.fine_tune(5,1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,0.5452,0.889536,0.73301,05:12


epoch,train_loss,valid_loss,accuracy,time
0,0.449443,0.437027,0.862957,05:11
1,0.44379,0.405572,0.867065,05:10
2,0.413028,0.400557,0.870052,05:10
3,0.345883,0.328906,0.89283,05:10
4,0.3045,0.321251,0.896191,05:10


#Test time augmentation
Untill now we have been using random cropping as data augmentation, this leads to better generalization. This can be problematic, especially in multilabel datasets. Important parts of the image might be dropped out. A possible solution is to squash or stretch imgs. This makes it harder to learn shapes and we miss out on important img augmentation. 

TTA: 
Select number of areas to crop, pass through model, take max/avg of preds. Do this for diff values across TTA params. Will not increase training time, but will incrase valididation time.  

In [12]:
preds, targs = learn.tta()


In [13]:
accuracy(preds, targs).item()

0.9006721377372742

#Mixup
Can improve performance when:
- don't have much data
- have pretrained model that was trained on different data

Data augmentation is important for generalization. How much depends on your data. Dialing how much augmentation to do can lead to better results. 

Mixup works in the following way, for each image:
1. Select another img from dataset at random
2. Pick a random weight
3. Take weighted avg of the two images, this is independent variable
4. Take weighted avg of targets from the two imgs, this will be dependent variable

 


In [15]:
model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(),
                metrics=accuracy, cbs=MixUp) #cbs for callback
learn.fit_one_cycle(5, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,2.195424,2.042561,0.396565,02:49
1,1.733193,1.385232,0.571322,02:47
2,1.49059,1.028455,0.660568,02:45
3,1.326894,0.791496,0.760269,02:44
4,1.203354,0.699952,0.796117,02:45


When we use mixup the model:
- will be harder to train, thus needs _many_ more epochs.
- will be less likely to overfit, because imgs will be different each epoch

Mixup is not exclusively for image recognition. It has also been used in NLP and other types of data. 

Mixup also addresses the issue that loss never will be perfect. The labels in the data is either 0 or 1 (given binary clf.), but the output of sigmoid will be in the range (0,1). When we train our model, the model will make more extreme predictions for each epoch. With mixup the label will also be in the range (0,1), given that it doesn't mix two of the same class. 

#Label smoothing
Labels for training data in a binary clf. is usually one-hot encoded, meaning they're either 0 or 1. Because of this the model is trained to predict either 0 or 1. Even 0.999 isn't good enough, and the optimization algo will try to improve this prediction. This can be harmfull and lead to overfitting, especially when you have mislabeled data. 

Instead of having labels that is either 0 or 1, we could have labels which are slightly higher than 0 and slightly lower than 1. This will make inference more robust, even if there is mislabeled data. 

When using label smoothing 0's will be replaced with $\epsilon / N$, where $N = $ # classes. The 1's will be replaced by $1 - \epsilon + (\epsilon / N)$

When you use label smoothing your model will be less likely to have large weights and activations. Large weights can lead to bumpy functions, where a small change in input might lead to big change in output. 

When the labels are 0 or 1 and gradients are bounded to -1 and 1, transfer learning becomes harder. This is because loss due to incorrect preds are unbounded, but we can only take a limited step each time. 

In [16]:
model = xresnet50()
learn = Learner(dls, model, loss_func=LabelSmoothingCrossEntropy(),
                metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,2.767864,3.278496,0.416729,02:48
1,2.254204,2.394877,0.515683,02:47
2,1.966359,1.917373,0.681105,02:48
3,1.776351,1.812582,0.744586,02:45
4,1.636841,1.582906,0.810306,02:45
