# CSCI E-25      
## Transfer Learning and Data Augmentation  
### Steve Elston

## Introduction

**Transfer learning** and **data augmentation** are important approaches to practical use of deep neural network models for computer vision. In this notebook you will work with examples of each of these methods.    

> **Note:** Portions of the text and code in this notebook was taken from[Yixing Fu's](https://github.com/yixingfu) Keras example, updated 2023/07/10 and titled *Use EfficientNet with weights pre-trained on imagenet for Stanford Dogs classification*.

### Transfer learning   
Transfer learning employs deep neural network models which have been previously trained on large datasets. This pre-training can require enormous computing resources, and management of massive datasets. These resources are often not available in practice. The dataset available for a particular problem may be too small, or the computing resources may not be available for the project.  

Most deep learning platforms have a module of pre-trained models. You can find an [extensive lists of pre-trained models for Keras](https://keras.io/api/applications/).

Instead, we can use a model with weights trained previously on other datasets. We stay that we **transfer** the learning from one task **learned** with some training data to another task with different image characteristics. This process is known as **transfer learning**.

In most cases of transfer learning there are two major components of the model, a **backbone network** and a **head**. The pre-trained **backbone** network produces a feature map. A head, placed on the backbone, performs the task-specific learning. Examples of task-specific learning include:   
- **Object classification:** Our goal for the exercises in this lesson.  
- **Object detection:** Fine the objects in an image.  
- **Semantic segmentation:** Detect and label the types of 'things' in an image.  
- **Object tracking:** Track how the objects in an series of images (a video) are moving.   

There are several approaches to task-specific training with transfer learning:     

**Frozen backbone network:** The weights of the backbone network are frozen and the resulting feature map is used directly. The weights of the task-specific head are learned using task-specific data. This approach minimizes the volume of task-specific data required, since only the weights of the head need be trained. This approach leads to methods known as **few shot training** for a specific case of a task. Accuracy may be sacrificed since the feature map is not optimized for the task.         

**Fine tune training of the backbone network:** Weights of the task specific head are learned using the task-specific data. At the same time, the weights of the backbone network are **fine-tuned** for the task. In many cases, only a few epochs are required for fine tuning. This approach often provides better performance, since the feature map produced by the backbone has the chance to learn task-specific features. However, fine tuning of the backbone network may fail if there is insufficient data to effectively train the large number of additional weights.     


### Data augmentation     

Even with good pre-training, sufficient task-specific data may not be available even to learn the required head weights using transfer learning. In such cases one can apply **data augmentation**.  The process of data augmentation creating new training samples from existing training images. The new image samples are created by **randomly** applying one or more **transformations** to the original image. The label of the transformed image is the same as the original image. Yet, given the randomly chosen transformations applied, the new image will have different characteristics. Further, since the transformations are random in nature, several new samples can be created from the same original image. Thus, augmented data will help training models for better generalization.

Deep learning platforms include packages for standard random data augmentation. For example, in the Keras documentation you can find [examples of applying random transformations](https://www.tensorflow.org/tutorials/images/data_augmentation) to augment image data.

Examples of random transformations which can be used to augment image training data include:   
1. Random rotation.     
2. Random translation along either or both axes.  
3. Random cropping of the image followed by resizing to the original size.
4. Flipping the transformed image to create a mirror image.   
5. Randomly adjusting the contrast of the image.   
6. Adding Gaussian or other noise to the image.  
7. Independently applying random brightness adjustments to the histograms of the color channels.
8. Randomly down-sampling followed by resizing.  

## What is EfficientNet

EfficientNet, first introduced in [Tan and Le, 2019](https://arxiv.org/abs/1905.11946)
is among the most efficient models (i.e. requiring least FLOPS for inference)
that reaches State-of-the-Art accuracy on both
Imagenet and common image classification transfer learning tasks.

The smallest base model is similar to [MnasNet](https://arxiv.org/abs/1807.11626), which
reached near-SOTA level performance with a significantly smaller model. By introducing a heuristic for
scaling the model, EfficientNet provides a family of models (B0 to B7) that represents a
good combination of efficiency and accuracy on a variety of scales. The compound-scaling
heuristics (for details see
[Tan and Le, 2019](https://arxiv.org/abs/1905.11946)) allows the
efficiency-oriented base model (B0) to surpass models at every scale, while avoiding
extensive grid-searches of hyperparameters.

A summary of the latest updates on the model is available at
[here](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet). Various
augmentation schemes and semi-supervised learning approaches are applied to
improve the Imagenet performance of the models. These extensions are applied to weights maintaining constant model architecture.

### B0 to B7 variants of EfficientNet

Based on the [original EfficientNet paper](https://arxiv.org/abs/1905.11946) people may have the
impression that the model comprises a continuous family created by arbitrarily
choosing scaling factors.  However, choice of resolution,
depth and width are also restricted by multiple factors:

- **Resolution:** Resolutions not divisible by 8, 16, etc. cause zero-padding near boundaries
of some layers, wasting computational resources. This constrain applies particularly to smaller
variants of the model, hence the input resolution for B0 and B1 are chosen as 224 and
240.

- **Depth and width:** The building blocks of EfficientNet demands channel size to be
multiples of 8.

- **Resource limit:** Memory limitation can limit resolution when depth
and width are increased. Increasing depth and/or
width while maintaining resolution can still improve performance.

As a result, the depth, width and resolution of each variant of the EfficientNet models
are chosen to produce good results, even if these choices deviate from the compound scaling formula.
As a result, the Keras implementation only provides 8 models, B0 to B7,
instead of allowing arbitrary choice of width, depth and resolution parameters.

### Keras implementation of EfficientNet

An implementation of EfficientNet B0 to B7 has been shipped with Keras since v2.3. To
use EfficientNetB0 for classifying 1000 classes of images from ImageNet, run:

```python
from tensorflow.keras.applications import EfficientNetB0
model = EfficientNetB0(weights='imagenet')
```

EfficientNetB0 model takes input images of shape `(224, 224, 3)`, and the input data should be unsigned integers in the range `[0, 255]`. Normalization is included as part of the model.

Because training EfficientNet on ImageNet takes a enormous resources the Keras
implementation by default loads pre-trained weights obtained via training with
[AutoAugment](https://arxiv.org/abs/1805.09501).

For B0 to B7 base models, the input shapes increase as shown in the table:

| Base model | resolution|
|----------------|-----|
| EfficientNetB0 | 224 |
| EfficientNetB1 | 240 |
| EfficientNetB2 | 260 |
| EfficientNetB3 | 300 |
| EfficientNetB4 | 380 |
| EfficientNetB5 | 456 |
| EfficientNetB6 | 528 |
| EfficientNetB7 | 600 |

When the model is intended for transfer learning, the Keras implementation
provides a option to remove the top layers:
```
model = EfficientNetB0(include_top=False, weights='imagenet')
```
This option excludes the final `Dense` layer that turns 1280 features from the penultimate
layer into predictions for the 1000 ImageNet classes. Replacing the top layer with a task-specific head allows using the effective feature map created by EfficientNet.

Another important argument is `drop_connect_rate` which controls
the dropout rate responsible for [stochastic depth](https://arxiv.org/abs/1603.09382).
This parameter controls additional regularization during fine tuning, but does not
affect pre-trained weights. For example, when stronger regularization is desired, increase the drop_connect_rate from the default of 0.2:

```python
model = EfficientNetB0(weights='imagenet', drop_connect_rate=0.4)
```



## Classification of Stanford Dogs with EfficientNetB0

EfficientNet is capable of a wide range of image classification tasks. As a result, transfer learning with this model can be applied to a number of tasks. The EfficientnetB0 model, is the smallest model in the EfficientNet family. This version of the model is faster to train and faster at inference, but with reduced accuracy.

In your notebook, we will use a pre-trained **EfficientnetB0** to classify the breeds of dogs in the
[Stanford Dogs](http://vision.stanford.edu/aditya86/ImageNetDogs/main.html) dataset. The remainder of the notebook contains code for training the model using three approaches:    
1. **Training from scratch:** In this case no pre-trained weights are used. The model weights are learned from scratch using the training data.     
2. **Trainng the head with frozen backbone weight:** Here, only the weights of the classification head of the model are trained, with the **backbone weights frozen**. In other words, the feature map is created from the pretrained weights. This feature map is used by the classifier head being trained using the training data.      
3. **Fine tuning weights:** In some caes, it is possible to improve model performance by **fine tuning** the weights of the backbone. The fine tuning can result in a feature map better suited to the feature specific to images used for the particular case. Once the model in the notebook is trained with the fronzen backbone weights, the weights of the trainable layers are made `trainable`, or unfrozen.  

### Setup to Run this Notebook

This notebook was created and tested using a Google Colab Pro+ account. While not considered large by current standards, training the models in this notebook is computationally intensive.  Expect long run-times for model training in any environment. You are free to run this notebook in any environment of your choosing that has sufficient resources. 

To run the notebook in Colab you will need a [Google Colabratory account](https://colab.research.google.com/) if you do not already have one. Log into your google account. You can then *Upload* this notebook into your work Colab space. Make sure you configure the Runtime to use an appropriate GPU, such as A100. Large memory should not be required. Further, a dedicated Google cloud storage account (not GoogleDrive) is required.   

Depending on the environment you are using, you may need to install and import `np_utils` explicitly. If you find this is the case, uncomment and execute the code in the cell below.   

In [None]:
#!pip install np_utils
#import np_utils as ku

Execute the code in the cell below to import the packages required to execute the remainder of this notebook.  

Notice that the **EfficientnetB0** model is imported.  

In [None]:
import numpy as np
import tensorflow_datasets as tfds
import tensorflow as tf  # For tf.data
import matplotlib.pyplot as plt
import keras
from keras import layers
from keras.applications import EfficientNetB0

# IMG_SIZE is determined by EfficientNet model choice
IMG_SIZE = 224
BATCH_SIZE = 64

print('TensorFlow version = ' + str(tf.__version__))
print('Keras version = ' + str(keras.__version__))

Ensure that your environment has a version of TensorFlow $\ge 2$ and Keras version $\ge 3$.

### Loading data

The Stanford Dogs dataset contains over 20,000 images of 120 dog breeds. The goal learn a model to classify dog breeds correctly from the images.

Here we load data from [tensorflow_datasets](https://www.tensorflow.org/datasets)
(hereafter TFDS). By simply changing `dataset_name` below, you may also try this notebook for
other datasets in TFDS such as
[cifar10](https://www.tensorflow.org/datasets/catalog/cifar10),
[cifar100](https://www.tensorflow.org/datasets/catalog/cifar100),
[food101](https://www.tensorflow.org/datasets/catalog/food101),
etc. When the images are smaller than the required size for EfficientNet input,
they are up-sampled. It has been shown in
[Tan and Le, 2019](https://arxiv.org/abs/1905.11946) that transfer learning
works better with increased resolution even if input images remain small.

> **Note** You may see a warning about a missing file, `dataset_info.json`. You can safely ignore this warning as this file is not necessary.  

In [None]:
dataset_name = "stanford_dogs"
(ds_train, ds_test), ds_info = tfds.load(
    dataset_name, split=["train", "test"], with_info=True, as_supervised=True
)
NUM_CLASSES = ds_info.features["label"].num_classes


When the dataset include images of various size, they are resized to the common required size. The Stanford Dogs dataset includes only images of size at least $200 \times 200$. These images are resized as required for EfficientNet.

In [None]:
size = (IMG_SIZE, IMG_SIZE)
ds_train = ds_train.map(lambda image, label: (tf.image.resize(image, size), label))
ds_test = ds_test.map(lambda image, label: (tf.image.resize(image, size), label))

### Resizing and visualizing the data

The next several code cells import the Stanford Dogs dataset, resizes the images to $224 \times 224$ pixels, as required by EfficientNetB0. The Stanford Dogs dataset comes with independently sampled train and test sets.     

Execute the code in the cell below to display the first 9 images with their labels.

In [None]:
def format_label(label):
    string_label = label_info.int2str(label)
    return string_label.split("-")[1]


label_info = ds_info.features["label"]
print('Number of classes: ' + str(NUM_CLASSES) + '\n\n')
for i, (image, label) in enumerate(ds_train.take(9)):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(image.numpy().astype("uint8"))
    plt.title("{}".format(format_label(label)))
    plt.axis("off")


Notice about the following about these images:   
1. The are different crops and angles for the dog in each image.    
2. The dog in the images have a variety of scales.  
3. Some images contain objects that are not dogs.


### Data augmentation

We can use the pre-processing layers APIs for image augmentation.

Very often only limited task-specific training data is available. In such cases, **augmenting** the training data can be an effective pre-processing step.  The code below performs a series of randomly selected data augmentation steps. The code in the second cell of this section displays a sample of the augmented images.     

The code in the cell below instantiates a Keras `Sequential` model object. This object can be used both as a part of
a model and as a function to pre-process data before training a model.  Execute this code.   

In [None]:
img_augmentation_layers = [
    layers.RandomRotation(factor=0.15),
    layers.RandomTranslation(height_factor=0.1, width_factor=0.1),
    layers.RandomFlip(),
    layers.RandomContrast(factor=0.1),
]


def img_augmentation(images):
    for layer in img_augmentation_layers:
        images = layer(images)
    return images


Using a function allows us to visualize some of the images. Execute the code to display 9 examples
of augmentation of a single image.

In [None]:
for image, label in ds_train.take(1):
    for i in range(9):
        ax = plt.subplot(3, 3, i + 1)
        aug_img = img_augmentation(np.expand_dims(image.numpy(), axis=0))
        aug_img = np.array(aug_img)
        plt.imshow(aug_img[0].astype("uint8"))
        plt.title("{}".format(format_label(label)))
        plt.axis("off")


> **Exercise 4-11:** Examine the foregoing result and answer these questions:   
> 1. Which augmentation methods are applied to the images?    
> 3. Describe some of the ways you can observed the multiple-transformation augmentation manifest in the displayed samples?   

> **Answers:**    
> 1.       
> 2.     

### Prepare inputs

Once we verify the input data and augmentation are working correctly,
we prepare dataset for training. The input data are resized to uniform
`IMG_SIZE`. The labels are put into one-hot
(a.k.a. categorical) encoding. The dataset is batched.

Note: `prefetch` and `AUTOTUNE` may improve performance for some situation depending on the specific environment and dataset used. See this [TensorFlow guide to data perfrmance](https://www.tensorflow.org/guide/data_performance) for more information.

In [None]:

# One-hot / categorical encoding
def input_preprocess_train(image, label):
    image = img_augmentation(image)
    label = tf.one_hot(label, NUM_CLASSES)
    return image, label


def input_preprocess_test(image, label):
    label = tf.one_hot(label, NUM_CLASSES)
    return image, label


ds_train = ds_train.map(input_preprocess_train, num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.batch(batch_size=BATCH_SIZE, drop_remainder=True)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)

ds_test = ds_test.map(input_preprocess_test, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(batch_size=BATCH_SIZE, drop_remainder=True)


## Steps in Training the Model

The remainder of the notebook contains code for traning the model by three different approaches. Examine the code and notice the three different traning approaches:    
1. **Training from scratch:** In this case no pre-trained weights are used. The model weights are learned from scratch using the training data.     
2. **Trainng the head with frozen backbone weight:** In this case, the weights of the classification head of the model are trained, with the backbone weights frozen. In other words, the feature map is created from the pretrained weights. The feature map is used by the classifier head. The classifier uses weights learned from the training data.
3. **Fine tuning weights:** In some cases, it is possible to improve model performance by fine tuning the weights of the backbone. The fine tuning can result in a more task-specific feature map. With the head of the model trained with the fronzen backbone weights, the weights of the trainable layers are unfrozen and then incrementally trained.    

> **Warning!! Expect slow execution!** It appears that the model is intended to be run using **Tensor Processing Units (TPUs)** rather than GPUs. However, configuring the environment to work correctly with TPUs is quite challenging. The notebook will execute with the above modifications, but slowly. Over 2 minute per epoch is required for the untrained model. Models using transfer learning require about a minute or less per epoch.    


## Training the model from scratch

Note: the accuracy will increase very slowly and may overfit.

In [None]:
model = EfficientNetB0(
    include_top=True,
    weights=None,
    classes=NUM_CLASSES,
    input_shape=(IMG_SIZE, IMG_SIZE, 3),
)
model.compile(optimizer="adam", loss="categorical_crossentropy",
              metrics=["accuracy",  "top_k_categorical_accuracy"])

model.summary()

In [None]:
epochs = 30 
hist = model.fit(ds_train, epochs=epochs, validation_data=ds_test)

Training the model is relatively fast. This might make it sounds easy to simply train EfficientNet on any
dataset wanted from scratch. However, training EfficientNet on smaller datasets,
especially those with lower resolution like CIFAR-100, faces the significant challenge of
overfitting.

Hence training from scratch requires very careful choice of hyperparameters and is
difficult to find suitable regularization. It would also be much more demanding in resources.
Plotting the training and validation accuracy
makes it clear that validation accuracy stagnates at a low value.

In [None]:
import matplotlib.pyplot as plt

def plot_hist(hist):
    _,ax = plt.subplots(1,2, figsize = (12,6))
    ax[0].plot(hist.history["accuracy"], label="train")
    ax[0].plot(hist.history["val_accuracy"], label="validation")
    ax[0].plot(hist.history['top_k_categorical_accuracy'], label="Top 5 train accuracy")
    ax[0].plot(hist.history['val_top_k_categorical_accuracy'], label="Top 5 validation accuracy")
    ax[0].set_title("model accuracy")
    ax[0].set_ylabel("accuracy")
    ax[0].set_xlabel("epoch")
    ax[0].legend(loc="upper left")
    ax[1].plot(hist.history["loss"], label="train")
    ax[1].plot(hist.history["val_loss"], label="validation")
    ax[1].set_title("model loss")
    ax[1].set_ylabel("loss")
    ax[1].set_xlabel("epoch")
    ax[1].legend(loc="upper right")
    plt.show()

plot_hist(hist)

> **Exercise 4-12:** Examine the results of training from scratch and answer these questions:   
> 1. Based on the validation loss and accuracy, is the model continuing to learn toward the end of the epochs?     
> 2. Based on the training and test losses and error rates, does the model show generalization as the the training epochs progress.

> **Answers:**   
> 1.     
> 2.      

## Transfer learning from pre-trained weights

In transfer learning the convolutional layers that generate the feature map use pre-trained weights that are frozen during training. These pre-trained layers constitute a **convolutional backbone**. A new task-specific head is added on top of the backbone and trained. The argument `include_top=False` creates a model object without the head. We then initialize the model with pre-trained ImageNet weights, and train the head weights using the task-specific dataset.

Notice in the model object is instantiated using the `nclude_top=False` argument.   The `weights='imagenet'` argument applies weight learned by supervised learning with the large [ImageNet dataset](https://www.image-net.org/) dataset to the backbone. These weights are then frozen by setting the `model.trainable` attribute.      

Layers of the task-specific head are specified and added to the model. The first of these layers has the input `model.output`, which is the output of the backbone.   

There are two hyperparameters we should set before attempting to train the model. The hyperparameter space was explored through a series of experiments. The experiments and results shown in the table below were run only on the first 10 epochs to limit experimentation time.

| Accuracy  | Learning Rate | Weight Decay |    
| --------  | ------------- | ------------ |     
| .7898     | $10^{-2}$       | None       |      
| .7853     | $10^{-2}$      | 0.001       |    
| .7931     | $10^{-2}$      | 0.01        |
| .8180     | $10^{-3}$      | 0.01        |
| .8108     | $10^{-4}$      | 0.01        |

The experimentation shows that performance of the model can be improved by relatively low learning rate and use of weight decay for regularization. It may be the case that we could do even better with a variable learning rate, but we will not do so in the name of simplification. For more complex models a variable learning rate may be essential for good training.    

Execute this code.

In [None]:

def build_model(num_classes):
    inputs = layers.Input(shape=(IMG_SIZE, IMG_SIZE, 3))
    model = EfficientNetB0(include_top=False, input_tensor=inputs, weights="imagenet")

    # Freeze the pretrained weights
    model.trainable = False

    # Rebuild top
    x = layers.GlobalAveragePooling2D(name="avg_pool")(model.output)
    x = layers.BatchNormalization()(x)

    top_dropout_rate = 0.2
    x = layers.Dropout(top_dropout_rate, name="top_dropout")(x)
    outputs = layers.Dense(num_classes, activation="softmax", name="pred")(x)

    # Compile
    model = keras.Model(inputs, outputs, name="EfficientNet")
    optimizer = keras.optimizers.Adam(learning_rate=1e-4, weight_decay=0.01)
    model.compile(
        optimizer=optimizer, loss="categorical_crossentropy",
        metrics=["accuracy",  "top_k_categorical_accuracy"]
    )
    return model


We are now ready to train the task-specific head. Note that the convergence may take more than 30 epochs depending on choice of learning rate.

In part, the training is slowed by the use of data augmentation. However, if image augmentation layers were not applied, the validation accuracy may only reach ~60%.

Execute the code in the cell below to train the model.

In [None]:
model = build_model(num_classes=NUM_CLASSES)
print(model.summary())

epochs = 30
hist = model.fit(ds_train, epochs=epochs, validation_data=ds_test)
plot_hist(hist)

> **Exercise 4-13:** Next, examine the results of training the task-specific head using pre-trained backbone (frozen weight) and answer the following questions.     
> 1. A task-specific classifier head as been added to the model in the `build_model` function. Briefly describe the layers of this head.
> 2. Compare the number of parameters for the model with frozen backbone weights just trained to the previously trained model with all free parameters. How significant is this difference in terms of information required for training the model?
> 3. In terms of training and test loss and accuracy, describe the progress of the training of the head and if the training appears to be nearly complete?    
> 4. What evidence is there that the model will generalize?   

> **Answers:**      
> 1.    
> 2.     
> 3.        
> 4.          

### Evaluate the trained model

We will now apply these evaluation functions to evaluate the trained model:     
- Accuracy.
- Top 5 accuracy.           
- Weighted mean precision and recall measures.

Execute the code in the cell below.

In [None]:
import sklearn.metrics as metrics

def print_model_performance(test_labels, test_model):
    ## Compute predicted labels
    predictions = test_model.predict(ds_test, batch_size=1)
    predicted = predictions.argmax(axis=1)

    k = 5
    print('Overall accuracy = ' + str(round(metrics.accuracy_score(test_labels, predicted), 4)))
    print('Top 5 accuracy = ' + str(round(metrics.top_k_accuracy_score(test_labels, predictions, k=k),4)))

    unique_labels, label_counts = np.unique(test_labels, return_counts=True)
    class_precision = metrics.precision_score(test_labels, predicted, labels=unique_labels, average=None)
    class_recall = metrics.recall_score(test_labels, predicted, labels=unique_labels, average=None)

    sum_label_counts = np.sum(label_counts)
    weighted_average = lambda x: round(np.sum(np.divide(x * label_counts, sum_label_counts)), 4)
    print('Average precision = ' + str(weighted_average(class_precision)))
    print('Average recall = ' + str(weighted_average(class_recall)))
    return predicted

## Find the maximum class probability from the softmax output of the model
one_hot_labels = [x for _,x in  ds_test.unbatch().as_numpy_iterator()]
test_labels = np.argmax(one_hot_labels, axis=1)

predicted = print_model_performance(test_labels, model)

To continue the evaluation we will now examine the confusion matrix. Given the number of classes, it is not only impractical, but useless to print the numerical values of the matrix. As an alternative, execute the code in the cell below to display a heat map of the confusion matrix.   

In [None]:
def plot_confusion_matrix(test_labels, predicted):
    confusion_matrix = metrics.confusion_matrix(test_labels, predicted)

    plt.figure(figsize = (12,9))
    p = plt.imshow(np.log(np.divide(confusion_matrix + 1.0, np.sum(confusion_matrix, axis=1))))
    cb = plt.colorbar(p)
    _=cb.set_label('Log count')

plot_confusion_matrix(test_labels, predicted)

> **Exercise 4-14:**
> 1. Examine the accuracy, average precision and average recall computed from the model. What do these figures tell you about the consistency of the classification performance of the classifier model across the different classes (e.g. performance for the individual classes)?    
> 2. Examine the heat map of the log confusion matrix. What does the small number of off-diagonal weakly bright spots on this plot tell you about the errors for this classifier model?  

> **Answers:**
> 1.        
> 2.     

## Fine Tuning the Model

Keras models maintains weight values from one training to the next. As a final step in model training, we will attempt fine-tuning of the weights of the entire model. Freezing/unfreezing models by setting `trainable` for a model allows all layers of the model to be trainable so updates to the weights can be learned.  The point of fine tuning is increase model performance by performing an incremental task-specific training of the the model weights. Using task-specific data for this training step allows the backbone to learn some task-specific features, different from the ones learned from ImageNet data alone. Since there are a large number of weights and only limited training data, only a small improvement is expect from this refinement.      

In this example we will unfreeze the weights of all layers in the backbone. Keep in mind that is some cases it may be desirable to only unfreeze a fraction of the layers, especially if the model is very deep and with a large number of parameters.

To perform the fine tuning the, execute the code below. Notice that the learning rate has been reduced to limit the chance of overfitting.  

In [None]:
def unfreeze_model(model):
    # We unfreeze the top 20 layers while leaving BatchNorm layers frozen
    for layer in model.layers[-20:]:
        if not isinstance(layer, layers.BatchNormalization):
            layer.trainable = True

    optimizer = keras.optimizers.Adam(learning_rate=1e-5, weight_decay=0.01)
    model.compile(
        optimizer=optimizer, loss="categorical_crossentropy",
        metrics=["accuracy",  "top_k_categorical_accuracy"]
    )


unfreeze_model(model)

epochs = 10  # @param {type: "slider", min:4, max:10}
hist = model.fit(ds_train, epochs=epochs, validation_data=ds_test)
plot_hist(hist)

To evaluate the results of the fine tuning execute the code in the cell below.

In [None]:
predicted = print_model_performance(test_labels, model)
plot_confusion_matrix(test_labels, predicted)

> **Exercise 4-15:** Examine the results of the fine tuning withe the unfrozen backbone weights and answer these questions:     
> 1. Comparing the validation accuracy of the last epochs of the model with frozen backbone weights and the fine tuning with unfrozen weights. Has the fine tuning improved the overall validation model performance?    
> 2. Is there any evidence of learning during the fine-tune in the training or validation loss and why?
> 3. Is there any evidence of over fitting during the fine tuning and why?

> **Answers:**     
> 1.     
> 2.     
> 3.    

#### Copyright 2023, 2024, 2025, 2026 Stephen F Elston. All rights reserved.  