#### Transfer Learning
It is generally not a good idea to train a very large DNN from scratch without first trying to find an existing neural network that accomplishes a similar task to the one you are trying to tackle. If you find such as neural network, then you can generally reuse most of its layers, except for the top ones. This technique is called *`transfer learning`*.

There are two common ways to do transfer learning:

1. **Feature extraction** – freeze all the pretrained layers and only train a new classifier (the very top layers) on your data.  

2. **Fine-tuning** – unfreeze some (or all) of the pretrained layers and continue training them with a very low learning rate so the network gently adapts to your new task.

In [1]:
# Importing Libraries
import numpy as np
from tensorflow import keras
from keras.datasets import fashion_mnist
from keras import Sequential
from keras.layers import Input, Dense, Flatten

In [2]:
# Load the Fashion MNIST dataset
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

In [3]:
# Building the model
model = Sequential([
    Input(shape = x_train.shape[1:]),
    Flatten(),
    Dense(units = 128, activation = 'relu'),
    Dense(units = 64, activation = 'relu'),
    Dense(units = 32, activation = 'relu'),
    Dense(units = 16, activation = 'relu'),
    Dense(units = 1, activation = 'sigmoid'),
])

model.compile(optimizer = keras.optimizers.Adam(learning_rate=0.001), loss = 'binary_crossentropy', metrics = ['accuracy'])
model.summary()

*A ready-made already trained binary classifier model for use in our task.*

In [4]:
history = model.fit(x_train, np.where(y_train == 1, 1, 0), epochs = 10, validation_data = (x_test, np.where(y_test == 1, 1, 0)))

Epoch 1/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 3ms/step - accuracy: 0.9895 - loss: 0.0678 - val_accuracy: 0.9925 - val_loss: 0.0279
Epoch 2/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.9933 - loss: 0.0289 - val_accuracy: 0.9908 - val_loss: 0.0392
Epoch 3/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.9940 - loss: 0.0241 - val_accuracy: 0.9912 - val_loss: 0.0325
Epoch 4/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 4ms/step - accuracy: 0.9950 - loss: 0.0210 - val_accuracy: 0.9928 - val_loss: 0.0367
Epoch 5/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 4ms/step - accuracy: 0.9957 - loss: 0.0175 - val_accuracy: 0.9948 - val_loss: 0.0188
Epoch 6/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 4ms/step - accuracy: 0.9952 - loss: 0.0221 - val_accuracy: 0.9943 - val_loss: 0.0250
Epoch 7/10
[1m1

In [5]:
# Saving the model
model.save(filepath = "model_1.keras")

*Leveraging the pre-trained model for our specific task*

In [None]:
# Loading the same model
model = keras.models.load_model("model_1.keras", compile = False)

"""
    Don't do this
        freezed_model = keras.models.Sequential(model.layers[:-1])
        freezed_model.add(keras.layers.Dense(units = 10, activation = "softmax"))
"""
# model and freezed_model now share some layers. When you train freezed_model, it will also affect model. If you want to avoid that, you need to clone model before you reuse its layers.

# Cloning the model
cloned_model = keras.models.clone_model(model) # tf.keras.models.clone_model() only clones the architecture, not the weights. If you don’t copy them manually using set_weights(), they will be initialized randomly when the cloned model is first used.

# Freezed model
freezed_model = keras.models.Sequential(cloned_model.layers[:-1])
freezed_model.add(keras.layers.Dense(units = 10, activation = "softmax"))

In [8]:
freezed_model.layers

[<Flatten name=flatten, built=True>,
 <Dense name=dense, built=True>,
 <Dense name=dense_1, built=True>,
 <Dense name=dense_2, built=True>,
 <Dense name=dense_3, built=True>,
 <Dense name=dense_5, built=False>]

*If the input pictures for your new task don’t have the same size as the ones used in the original task, you will usually have to add a preprocessing step to resize them to the size expected by the original model. More generally, transfer learning will work best when the inputs have similar low-level features.*

In [9]:
# Initially freezing all the reused layers
for layer in freezed_model.layers[:-1]:
    layer.trainable = False

*The more similar the tasks are, the more layers you will want to reuse (starting with the lower layers). For very similar tasks, try to keep all the hidden layers and just replace the output layer.*

In [None]:
# Compiling the model after freezing all the reused layers
freezed_model.compile(optimizer = keras.optimizers.Adam(learning_rate=0.001), loss = 'sparse_categorical_crossentropy', metrics = ['accuracy']) # You must always compile your model after you freeze or unfreeze layers.

# Training the model
history = freezed_model.fit(x_train, y_train, epochs=10, validation_data=(x_test, y_test))

Epoch 1/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 3ms/step - accuracy: 0.3805 - loss: 10.1155 - val_accuracy: 0.4998 - val_loss: 2.9459
Epoch 2/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.5332 - loss: 1.9109 - val_accuracy: 0.5765 - val_loss: 1.3154
Epoch 3/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.5681 - loss: 1.2284 - val_accuracy: 0.5603 - val_loss: 1.2057
Epoch 4/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.5760 - loss: 1.1785 - val_accuracy: 0.5961 - val_loss: 1.1584
Epoch 5/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.5819 - loss: 1.1689 - val_accuracy: 0.5914 - val_loss: 1.1728
Epoch 6/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.5812 - loss: 1.1659 - val_accuracy: 0.5711 - val_loss: 1.1732
Epoch 7/10
[1m

In [None]:
# Unfreezing the 3rd and 4th layer of reused layers
freezed_model.layers[3].trainable = True
freezed_model.layers[4].trainable = True

# Compiling the model again after unfreezing the training
freezed_model.compile(
    optimizer = keras.optimizers.Adam(learning_rate=0.0001),
    loss = "sparse_categorical_crossentropy",
    metrics = ["accuracy"]
)

*The more training data you have, the more layers you can unfreeze.*

*It is also useful to reduce the learning rate when you unfreeze reused layers: this will avoid wrecking their fine-tuned weights.*

In [15]:
# Training the model again
history = freezed_model.fit(x_train, y_train, epochs=10, validation_data=(x_test, y_test))

Epoch 1/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 4ms/step - accuracy: 0.6753 - loss: 0.9012 - val_accuracy: 0.7095 - val_loss: 0.8222
Epoch 2/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.7218 - loss: 0.7660 - val_accuracy: 0.7316 - val_loss: 0.7504
Epoch 3/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.7450 - loss: 0.7110 - val_accuracy: 0.7449 - val_loss: 0.7126
Epoch 4/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.7569 - loss: 0.6770 - val_accuracy: 0.7582 - val_loss: 0.6879
Epoch 5/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.7655 - loss: 0.6547 - val_accuracy: 0.7607 - val_loss: 0.6742
Epoch 6/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.7711 - loss: 0.6392 - val_accuracy: 0.7673 - val_loss: 0.6548
Epoch 7/10
[1m1

In [16]:
freezed_model.evaluate(x_test, y_test)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7778 - loss: 0.6262


[0.6262159943580627, 0.7778000235557556]

*The performance is increased but this is not how we suppose to apply transfer learning. What I did was, I tried many configurations until I found one that demonstrated a strong improvement. If you try to change the classes or the random seed, you will see that the improvement generally drops, or even vanishes or reverses. What I did is called “torturing the data until it confesses”.*

*It turns out that `transfer learning does not work very well with small dense networks`, presumably because `small networks learn few patterns, and dense networks learn very specific patterns`, which are unlikely to be useful in other tasks. Transfer learning works best with deep `convolutional neural networks`, which tend to learn feature detectors that are much more general (especially in the lower layers).*

*If you have plenty of training data, you may try replacing the top hidden layers instead of dropping them, and even adding more hidden layers.*