# Preface

In this notebook, we demonstrate a simple form of transfer learning, which uses pre-trained neural networks that are already trained on other (similar) datasets/tasks. There are two ways to use them:
  * as initializers (warm start)
  * as fixed feature extractors

We will also introduce the `tensorflow.keras.applications` interface which includes many pre-trained models that are hugely successful for their respective application domains. For actual practical use, you should start, whenever possible, with some of these models as baselines to modify and build your custom architectures.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from pathlib import Path
from tqdm.keras import TqdmCallback
sns.set(font_scale=1.5, style='dark')

# CIFAR-10 Dataset

We will use the familiar [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset we used previously.

![alt text](https://miro.medium.com/max/944/1*6XQqOifwnmplS22zCRRVaw.png "CIFAR10")

In [None]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
x_train = x_train / 255.0
x_test = x_test / 255.0
y_train = tf.keras.utils.to_categorical(y_train)
y_test = tf.keras.utils.to_categorical(y_test)

# Using Canned Models

Whenever possible, we should start with some canned models already developed by careful testing and fine-tuning. The `tensorflow.keras.applications` module is a collection of such models trained on enormous datasets such as the [ImageNet](http://www.image-net.org/) dataset.


Here, we will use the [ResNet](https://arxiv.org/abs/1512.03385) architecture which is immensely successful at image recognition and related tasks.
![alt text](https://miro.medium.com/max/1524/1*6hF97Upuqg_LdsqWY6n_wg.png)

In [None]:
from tensorflow.keras.applications.resnet import ResNet50
from tensorflow.keras.optimizers import Adam

First, we will randomly set the weights of the ResNet by setting `weights=None`. The include_top option says that we will keep the classification layers of the ResNet, since we are not using pre-trained weights.

In [None]:
baseline = ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=10,
)

In [None]:
baseline.summary()

In [None]:
def train_and_save(model, path, **kwargs):
    """
    Wrapper around the fit method to save
    results and load if saved files are found
    """
    path = Path(path)
    path.mkdir(exist_ok=True)
    model_path = path.joinpath('model.h5')
    history_path = path.joinpath('history.json')
    if model_path.exists() and history_path.exists():
        model.load_weights(str(model_path))
        history = pd.read_json(str(history_path))
    else:
        model.compile(
            loss='categorical_crossentropy', optimizer=Adam(0.0001), metrics=['accuracy'])
        history = model.fit(**kwargs)
        history = pd.DataFrame(history.history)
        history.to_json(str(history_path))
        model.save_weights(str(model_path))
    return model, history

In [None]:
baseline, baseline_history = train_and_save(
    model=baseline,
    path='resnet_cifar10_from_scratch',
    x=x_train,
    y=y_train,
    validation_data=(x_test, y_test),
    epochs=20,
    batch_size=128,
    verbose=0,
    callbacks=[TqdmCallback(verbose=1)],
)

In [None]:
baseline_history.plot(x=None, y=['accuracy', 'val_accuracy'])

# Using Canned Models with Pre-trained Weights

Now, we use ResNet with pre-trained weights, which are obtained on training on the ImageNet dataset. This can be done by simply setting `weights='imagenet'`.


This has 1000 classes, which is not what we need here. Hence, we will set `include_top=False` to set our own classification layers.

In [None]:
base_model = ResNet50(
    include_top=False,  # Do not include top
    weights='imagenet',  # load imagenet weights
    input_shape=(32, 32, 3),
)

In [None]:
base_model.summary()

Since our model here doesn't have a classification layer, we will build one accommodating our 10 class problem here.

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import GlobalAveragePooling2D, Dense

In [None]:
pretrained_model = Sequential(layers=[base_model])
pretrained_model.add(GlobalAveragePooling2D())
pretrained_model.add(Dense(units=10, activation='softmax'))

In [None]:
pretrained_model.summary()

In [None]:
pretrained_model, pretrained_history = train_and_save(
    model=pretrained_model,
    path='resnet_cifar10_pretrained',
    x=x_train,
    y=y_train,
    validation_data=(x_test, y_test),
    epochs=20,
    batch_size=128,
    verbose=0,
    callbacks=[TqdmCallback(verbose=1)],
)

In [None]:
pretrained_history.plot(x=None, y=['accuracy', 'val_accuracy'])
baseline_history.plot(x=None, y=['accuracy', 'val_accuracy'])

Observe that we obtain a much faster performance using pre-trained weights to warm start.

# Using Canned Models with Fixed Weights

In the previous example, we allowed all the weights in the combined network to vary during training. Hence, the ImageNet weights were only used as a warm start tool.

Here, we will explore an alternative, where the `base_model`'s weights are held constant and not trained. This is easily done by supplying the flag
```python
    model.trainable = False
```
This can also be set on layers
```python
    layer.trainable = False
```

In [None]:
base_model = ResNet50(
    include_top=False,  # Do not include top
    weights='imagenet',  # load imagenet weights
    input_shape=(32, 32, 3),
)

We could have set here
```python
    base_model.trainable = False
```
However, this is not good if we have batch normalization layers. Why?

In [None]:
for l in base_model.layers:
    if '_bn' not in l.name:
        l.trainable = False

The classification layers will not be held constant and shall be the only layers that are trained.

In [None]:
pretrained_model_v2 = Sequential(layers=[base_model])
pretrained_model_v2.add(GlobalAveragePooling2D())
pretrained_model_v2.add(Dense(units=10, activation='softmax'))

In [None]:
pretrained_model_v2.summary()

In [None]:
pretrained_model_v2, pretrained_v2_history = train_and_save(
    model=pretrained_model_v2,
    path='resnet_cifar10_pretrained_v2',
    x=x_train,
    y=y_train,
    validation_data=(x_test, y_test),
    epochs=20,
    batch_size=128,
    verbose=0,
    callbacks=[TqdmCallback(verbose=1)],
)

In [None]:
pretrained_v2_history.plot(x=None, y=['accuracy', 'val_accuracy'])
baseline_history.plot(x=None, y=['accuracy', 'val_accuracy'])

This time, by fixing all the base_model weights we can also obtain better results than the baseline. In fact, observe that the generalization gap is much better, because most of the weights have been fixed!

# Exercise

1. Train the above networks to completion using various other regularization techniques to get the best possible performance. Compare with our earlier investigations on CIFAR10.
2. Try transfer learning techniques on other types of data, e.g. RNN-type on language applications. There are much fewer pre-trained models in this direction, however, so you may have to implement your own pre-trained models to learn across tasks.