# Transfer Learning

> Transfer Learning is an idea to reuse knowledge learned by models in other tasks

This approach allows us to spend less time on:
- coding and coming up with neural network architectures 
- collecting data (as large amounts of data were already used to train these networks)

Furthermore, it helps with:
- generalization (knowledge from similar domain can be easily transferable)
- training time (weights are initialized better)

And, maybe even more important, __allows us to use knowledge from datasets which are too large to train on one's machine__

# torchvision

> `torchvision` ([documentation](https://pytorch.org/docs/stable/torchvision/models.html)) provides SOTA (or close to State Of The Art) neural network models for computer vision tasks

Those models were (usually) trained on a well-known `ImageNet` dataset

## ImageNet

[ImageNet](http://image-net.org/) is not only a dataset, but also __a yearly held classification competition__.

Overview of the dataset:
- Over 1 million images
- Images are of different sizes (but usually those are cropped to `224x224` - `384x384`)
- `1000` classes (a lot, this task is hard!)

> __One should keep current best models on ImageNet in mind as those are often used as standalone/part of other models!__

- At this moment EfficientNet based architectures are current SOTA (original research paper [here](https://arxiv.org/abs/1905.11946))
- __Around 90% Top-1 accuracy achieved__ (and 98% Top-5) which means __we are getting closer to solving this dataset as we have "solved" MNIST or CIFAR__

## Using models

> Loading `torchvision` models is simple, use [source code of model](https://pytorch.org/vision/0.8/models.html#torchvision.models.resnet18) to see all available arguments!

In [2]:
import torchvision

model = torchvision.models.squeezenet1_0(pretrained=True)

# Vision models classes

Models provided by `torchvision` (and not only) can be divided into a few categories (`torchvision` addition if provided by the package):

## Classification

> Basic task, most of the models were trained on ImageNet (or sometimes pretrained with even larger datasets beforehand). 

Accuracy classification looks more or less like below (non comprehensive list and grouped by theme, full list [here](https://paperswithcode.com/sota/image-classification-on-imagenet)), sorted from best to last:

- EfficientNet family - [research](https://arxiv.org/abs/1905.11946) | `EfficientNet-BN`, `EfficientNet-LN` and their variations
- ResNet family - we saw basic idea standing behind it during convolution classes | `torchvision` | `ResNext`, `ResNet`, `Wide ResNe(X)t`
- Inception family | `torchvision` | `InceptionV3`, `Xception`
- MobileNets | `torchvision` | `MobileNetV{1, 2, 3}`, used as building block of EfficientNet
- Older models of historical importance:
    - VGG family | `torchvision` | VGG11, VGG19, large and inefficient in comparison
    - AlexNet | `torchvision` | First neural network winning ImageNet competition
    
> __There are a lot of other interesting ideas presented in ImageNet related papers, read them if you are curious!__

### Which model should I choose?

As always, that depends on your use case, but rough guidelines could be:

- __ResNets__:
    - battle tested
    - work really well in many tasks
    - fast and well optimized in many frameworks (perfect for GPUs)
    - __may not be the most efficient parameter-wise__
    - __go to for initial runs__
- __EfficientNets__:
    - current SOTA
    - may not be as general as ResNet (though research is ever growing)
    - may not be as as optimized (ever changing, __potentially faster than ResNets, sometimes much faster__)
    - __more efficient parameter-wise__ (smaller model than ResNet, on the order of `10`)
    - __test when you want to push your accuracy__
    - __test when you want to deploy to mobile and other constrained devices__ (and you need better results)
- __MobileNets__:
    - really fast (especially on CPU)
    - __battle tested for edge deployment & constrained environments__ (AWS Lambda, Mobile)
    - can be really really small (below `1KK` parameters) yet good enough
    - __use for mobile, may handle a lot of tasks good enough!__

In [2]:
# Number in ResNet tells us how many layers it has
resnet = torchvision.models.mobilenet_v3_small(pretrained=True) # Loading weights trained on ImageNet

# Interesting model between MobileNets and good accuracy
mnasnet = torchvision.models.mnasnet1_0(num_classes=100) # Choosing classes

## Other tasks

- [Semantic Segmentation](https://pytorch.org/vision/stable/models.html#semantic-segmentation)
- [Object Detection & Image Segmentation](https://pytorch.org/vision/stable/models.html#object-detection-instance-segmentation-and-person-keypoint-detection)

![](images/segmentation_vs_detection.png)

[Image Source](https://towardsdatascience.com/a-hitchhikers-guide-to-object-detection-and-instance-segmentation-ac0146fe8e11)

__We will not go into details about those models during this lesson__, but important things to keep in mind:
- Those models use `classification` models seen above as __backbone__ (feature creator for specific task), __recurring theme in vision!__
- Usually trained on large [`COCO` dataset](https://cocodataset.org/)

# PyTorch Hub

> PyTorch provides hub from which one can simply download models ([page](https://pytorch.org/hub/) | [module](https://pytorch.org/docs/stable/hub.html))

It works in a similar fashion to `torchvision` and is currently being developed as __official source of PyTorch models__.

- Anyone can make their models work with PyTorch Hub
- `torchvision` models are available through it
- Other, non vision models are also provided (including NLP, Audio, Generative)

One can easily see available models in repository usuing `torch.hub.list`:

In [None]:
import torch

torch.hub.list(github="intel-isl/MiDaS")

## How to find repositories?

- Official repositories are linked on [PyTorch Hub](https://pytorch.org/hub/) webpage
- Non-official and hosted by users can be found in some repositories (still not such a common practice), __look for `hubconf.py` at the root of github project__ (and see next sections)

## More PyTorch Hub commands

> Watch out, some models are really large!

There are more commands useful for exploration, let's see the cell below:

In [None]:
import tempfile

# This directory will be removed after we leave context manager
with tempfile.TemporaryDirectory() as directory:
    # Where model will be downloaded
    torch.hub.set_dir(directory)

    print(torch.hub.list("pytorch/vision"))

    print(torch.hub.help("pytorch/vision", model="mobilenet_v3_large"))

    model = torch.hub.load(
        "pytorch/vision", model="mobilenet_v3_large", pretrained=True, progress=True
    )

In [None]:
# Finding out more aoubt downloaded methods

methods = dir(model) # available methods

# Info about specific model's method
help(model.xpu)

# Other sources

What if we can't find a desirable model? There are a few available alternatives:
- [paperswithcode](https://paperswithcode.com/) - outline current SOTA results including only papers with available source code (__quality of implementation not measured!__)
- [arxiv](https://arxiv.org/) - except research models, links to GitHub repositories are __sometimes__ provided, __usually in the abstract__
- GitHub accounts of respected research labs (also includes interesting technical solutions):
    - [Facebook Research](https://github.com/facebookresearch) - General | Vision
    - [DeepMind](https://github.com/deepmind) - General | Reinforcement Learning
    - [Google Research](https://github.com/google-research) - General | Health, Business use
    - [OpenAI](https://github.com/openai/) - General | NLP, large networks
    - [Microsoft Research](https://github.com/MicrosoftResearch) (more technical and less DL based, General)
    - [NVIDIA Research](https://github.com/NVlabs) - General | GANs, large scale networks
- [Distill.pub](https://distill.pub/) - research reviews & other publications, sometimes with code

# Model Conversion

> Some models are implemented in different frameworks (usually Tensorflow). We can use `ONNX` to make a conversion

# ONNX

> [ONNX](https://github.com/onnx/onnx) provides an open source format for AI models, both deep learning and traditional ML

- Transform models into open exchange framework `.onnx`
- Supported by major frameworks/tools

## Downsides

- Not all operations between frameworks are interchange'able
- For SOTA models conversion might be hard
- Puts constrains on some of the frameworks (e.g. PyTorch)

> We will see another way to export models for usage in different than Python environments later during `torchscript` lesson

> __`ONNX` should be used with care and only for inter-framework conversions.__

> We don't want you to know `ONNX` in and out, just keep this tool in mind when the right time comes!

## Why would I leave my framework?

PyTorch is great, but there are a few cases you might encounter were you need to switch, including:
- Part of team (or another team) uses different technology
- PyTorch does not support some form of deployment (which Tensorflow might)
- Hardware specific optimization is required and not possible in PyTorch
- Other parts of the pipeline are implemented in different framework

> Above (and many more) reasons also apply to other deep learning/machine learnig frameworks

## PyTorch front end

Let's see how we can export our PyTorch models to `ONNX` format using `torch.onnx` module:

In [28]:
import pathlib

model = torchvision.models.mobilenet_v2(pretrained=True)
model.eval() # Set in evaluation mode

# Batch size of `2` because of BatchNorm
example_input = torch.randn(2, 3, 224, 224)
with tempfile.TemporaryDirectory() as data_dir:
    torch.onnx.export(model, (example_input, ), pathlib.Path(data_dir) / "mobilenet_v2.onnx")

# Transfer learning

> Transfer learning is a process of reusing model(s) taught on another task and adjusting to our needs

## Per-domain models

There are some rough guidelines for different tasks:
- Vision:
    - ImageNet models (classification)
    - COCO pretrained models (with pretrained backbones from ImageNet classification)
- NLP:
    - Pretrained word embeddings
    - Large Transformer based architectures (usually BERT and it's variations)
    - __Still emerging approach__
    
For other tasks (e.g. reinforcement learning, one shot learning, GANs) transfer learning is not yet so widespread.

> Probably more pretrained models for different domains will emerge, as we have seen with vision and NLP tasks after that

Aforementioned domain-specific models use pretrained networks from vision (most often) as part of their model though.

## How to finetune?

We will focus on vision and classification tasks, though similar approach is used for NLP.

## Weight freezing

> Weight freezing means freezing __backbone__ (layers creating features) so __those will not learn anything__ and __only enabling last layer to learn on provided data__

### Pros

- __The more you freeze, the faster your neural network will run and less memory it will take!__
- Easier to finetune and "get right"
- We surely will not "destroy" weights learned on other task (which may sometimes occur at the beginning of training due to random initialization of layer)

### Cons

- Representational power is limited (as we cannot change frozen weights)
- We usually will not get best possible result (though we will get it faster)

### Tips

- There is no strict rule, you may unfreeze more parts of the network (though less common)
- You may start with weight freezing, unfreeze afterwards and finish with small learning rate (or disciminative learning rate), though __this will make the optimization procedure significantly harder__ to implement and reason about

## Discriminative learning rates
    
> Discriminative learning mean setting different learning rates for different part of the neural network

### Pros

- Larger representation space
- Probably better accuracy score
- We won't destroy pretrained weights (as their learning rate is smaller

### Cons

- __Way longer__ time to train as the whole network is used
- __Harder to finetune__ and "get right"

### Tips

- Divide your neural networks into few regions:
    - head should have standard learning rate
    - middle of the network should have it the same, but divided by `10`
    - first layers (finding general features) should have it the same, but divided by `10`
- `10` is not a strict rule but seems to work well in practice

# Data Augmentation

> __Data augmentation means changing data sample in some way__

We usually do that in order to:
- __improve generalization__:
    - model sees more samples (sample after augmentation may be totally different from the original one)
    - model sees different variants of samples
- __enhance size of the dataset__

Things to keep in mind:
- Augmentations are performed __ON THE FLY__, which means:
    - load image from disk (via `torch.utils.data.Dataset` instance)
    - perform a transformation
    - __Never perform transformations before training (e.g. by saving images)__ as it inflates disk usage without necessity
    - __If you want some speedups you can perform caching of some transformations__ (or cache part of the images in memory)
- __We have to PRESERVE LABELS__ - for object detection __we cannot__ transform target mask differently than our input image

## torchvision augmentations

> `torchvision` provides a few well-known augmentations __and this is a good starting point for your models__

All of them are provided in [`torchvision.transforms`](https://pytorch.org/vision/stable/transforms.html) submodule.

Keep in mind:
- `transforms` are done (usually) on a per-element basis
- `transforms` are (usually) passed to `torch.utils.data.Dataset` instances

Let's see how to create two augmentations via `torchvision`:

In [None]:
from torchvision import transforms

data_transformation = transforms.Compose([
     transforms.RandomCrop((224, 224)), # Randomly crop to this image size
     transforms.ToTensor(),
])

# Pass data_transformation to your torch.utils.data.Dataset

## Example augmentations

Below augmentations are:
- battle-tested
- work well for a lot of tasks (__given you preserve labels correctly!__)

### [RandomCrop](https://pytorch.org/vision/stable/transforms.html#torchvision.transforms.RandomCrop)

> __Take a random crop of the image with specified size__

This augmentation allows us to:
- Prevent model from overfitting as it sees different parts of the picture
- Increase number of samples
- __Hard to go wrong, BUT you shouldn't make the images much smaller__ (as it might lose too much information)

![](images/random_crop.jpg)

### [Random{Horizontal, Vertical}Flip](https://pytorch.org/vision/stable/transforms.html#torchvision.transforms.RandomHorizontalFlip)

> __Make a mirror-like flip of the image__

Things to note:
- __Very hard to go wrong__ (unless you are doing object detection)
- __Used more often than VERTICAL__ (because objects in nature are more closely related to horizontal flipping)
- __Don't go too crazy with VERTICAL flip__ - this one is probably a better choice

![](images/random_horizontal_flip.png)

### [RandomRotation](https://pytorch.org/vision/stable/transforms.html#torchvision.transforms.RandomRotation)

> __Randomly rotate image within `(-degree, degree)` range__

Things to note:
- __Used quite often__
- __May not work really well__ due to necessary zeros and/or zooming in in order to fill left-out space
- __Might not be the best first choice__

### [CutOut](https://pytorch.org/vision/stable/transforms.html#torchvision.transforms.RandomErasing)

> __Simple technique where you zero-out part of the image__

Relatively new augmentation for neural networks, __but seems to work really well__ due to:
- improving generalization (as neural network does not see random parts of the image)
- similarity to dropout, yet __not destroying `BatchNorm` layers__
- pretty simple and quick

![](images/cutout.png)

### MixUp

> __Mix two images together and mix their respective labels__

Deemed a blasphemy, __yet very effective for CNNs__, because:
- Smoother decision boundaries
- Learns what "cat-dog" like creature could look
- Conflicting feedback
- __Soft targets__

![](images/mixup.png)

## Other augmentations libraries

- __[Augly](https://github.com/facebookresearch/AugLy)__:
    - Facebook backed (first class PyTorch support)
    - __Non-image augmentations__ (audio, text etc.)
    - __Wide range of available augmentations__
- __[Albumentations](https://github.com/albumentations-team/albumentations)__:
    - __ONLY IMAGES__
    - __Library-agnostic__ (based on `np.ndarray` transformations)
    - __Needs some additional efforts to make it work with PyTorch__
    - __Wide range of available augmentations__ (more experimental than `torchvision`)

## Tips

- __Try not to do too many of them__ - you may lose original information
- Use the most standard ones (the ones mentioned in this notebook work reasonably well)
- __Start simple__ - rotations, flips, maybe cutout, move on to more sophisticated methods if needed
- __Tailor to your data__, e.g. class label is not dependent on the color, try `channel-shuffle` augmentation
- __Be creative__ - correct data augmentation can give you a few additional accuracy points!

# Exercise

> __Get best score in `5` minutes of training from scratch using pretrained model!__

- Implement `freeze` function taking in neural network  and setting `requires_grad_(False)` on each parameter
- Do the same for `unfreeze` but set parameter's gradient to `True`
- Load any model you want from `torchvision` (or maybe some other resource?):
    - The larger the better, but may not fit on the GPU
    - Use knowledge from the beginning when choosing it
- Print the model to get a little info about it's structure (backbone, bottleneck etc.)

How can I make my model perform better (or run faster)?

- Use one of two freezing modes (__or mix them by freezing only a few layer!__)
- Use data augmentations
- Use schedulers and optimizers
- Use our training system in order to train your models

__Additional__:

- Create a `MixUp` transform which:
    - Gets a batch of data in form `(X, y)` (features, labels)
    - Makes a random permutation of features and mixes these together with `lambda` hyperparameter (default equal to `0.5`)
    - Makes the same permutation on labels and mixes them together with `lambda` hyperparameter (default equal to `0.5`)
    
Use it for the task and see whether there's any improvement!

In [29]:
def freeze(module: torch.nn.Module):
    module.eval()
    for param in module.parameters():
        param.requires_grad_(False)
    
def unfreeze(module: torch.nn.Module):
    module.train()
    for param in module.parameters():
        param.requires_grad_(True)

In [None]:
# Your code

## Challenges

### Assessment

- What is "knowledge distillation"? Where is it used and what are the reasons?
- What is "quantization"? Why is it useful? When should we use it? Read about it in [PyTorch documentation](https://pytorch.org/docs/stable/quantization.html)
- What is "auto-augmentation"? Check `torchvision`'s version [here](https://pytorch.org/vision/stable/transforms.html#autoaugment-transforms)

### Non-assessment

- What is [CutMix](https://arxiv.org/abs/1905.04899) regularization strategy?
- Read about necessary steps to publish your models to PyTorch Hub [here](https://pytorch.org/hub/)
- What are [Adapters](https://arxiv.org/pdf/1902.00751.pdf)? 