## Understanding Pre-training vs. Fine-tuning

When dealing with building models in machine learning, you may come across the terms pre-training and fine-tuning. The question is: what do they mean and when would you want to use them?

**Pre-training:**
- Done when you want to give a model a general knowledge base.
- Typically done with larger datasets.
- Performed upfront when developing the initial model.
- Typically the whole network is trained throughout this process.

**Fine-tuning:**
- Later on, you may want to adapt a model to a specific use case that it may not be generalizing as well to.
- This is when you give it a smaller dataset, which is typically labeled, so that it can be adjusted.
- Common to only train the upper layers in these networks.

We'll walk through an example of doing both of these steps using the CIFAR-10 and CIFAR-100 datasets.

## Load in Libraries

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, Model, Input
from tensorflow.keras.datasets import cifar10, cifar100
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.optimizers import Adam

## Load and Preprocess the CIFAR-10 Dataset (Pre-training)

In [None]:
# Load in CIFAR-10
 (x_train_pre, y_train_pre), (x_test_pre, y_test_pre) = cifar10.load_data()

# Normalize pixel values to be between 0 and 1
x_train_pre = x_train_pre.astype('float32') / 255.0
x_test_pre = x_test_pre.astype('float32') / 255.0

# Print dataset shapes
print(f'CIFAR-10 Training data shape: {x_train_pre.shape}')
print(f'CIFAR-10 Training labels shape: {y_train_pre.shape}')

CIFAR-10 Training data shape: (50000, 32, 32, 3)
CIFAR-10 Training labels shape: (50000, 1)


## Build the CNN Model Using the Functional API

I typically prefer the functional API when using Keras, because it gives you much more flexibility. Even though I may not need to use the functional API in some cases (because the model is sequential), it still is much easier to work with, in my opinion.

In [None]:
# Creating residual block for CNN
# Adding regularization to prevent overfitting
def residual_block(x, filters):
    # Skip connection
    shortcut = x

    # Apply the first convolutional layer
    # Same padding will keep dim structure the same
    x = layers.Conv2D(filters, (3, 3), padding='same',
                      kernel_regularizer=tf.keras.regularizers.l2(1e-3))(x)
    # Normalize features in each batch
    x = layers.BatchNormalization()(x)
    # Apply ReLU
    x = layers.ReLU()(x)

    # Apply the second convolutional layer
    x = layers.Conv2D(filters, (3, 3), padding='same',
                      kernel_regularizer=tf.keras.regularizers.l2(1e-3))(x)
    x = layers.BatchNormalization()(x)

    # Project the shortcut to match the output shape if necessary
    if shortcut.shape[-1] != x.shape[-1]:  # Check if channel dimensions match
        # 1x1 conv for channel adjustment
        shortcut = layers.Conv2D(filters, (1, 1), padding='same',
                                  kernel_regularizer=tf.keras.regularizers.l2(1e-3),
                                  use_bias=False)(shortcut)
        # Add Batch Normalization to shortcut
        shortcut = layers.BatchNormalization()(shortcut)

    # Add the shortcut to the main path
    x = layers.add([shortcut, x])

    # Apply activation
    x = layers.ReLU()(x)
    return x

In [None]:
# Model input
inputs = Input(shape=(32, 32, 3))

# Initial Conv layer with increased filters for better feature extraction
x = layers.Conv2D(128, (3, 3), padding='same',
                  kernel_regularizer=tf.keras.regularizers.l2(1e-3))(inputs)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)

# Residual blocks
x = residual_block(x, 256)
x = layers.MaxPooling2D((2, 2))(x)

x = residual_block(x, 128)
x = layers.MaxPooling2D((2, 2))(x)

x = residual_block(x, 128)
x = layers.MaxPooling2D((2, 2))(x)

# Global Average Pooling to reduce dimensionality
# Gets it ready to feed into feed-forward portion of network
x = layers.GlobalAveragePooling2D()(x)

# Fully connected layers with batch normalization and dropout
x = layers.Dense(256, kernel_regularizer=tf.keras.regularizers.l2(1e-3))(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = layers.Dropout(0.5)(x)

x = layers.Dense(128, kernel_regularizer=tf.keras.regularizers.l2(1e-3))(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = layers.Dropout(0.5)(x)

# Output layer for CIFAR-10 (10 classes)
outputs = layers.Dense(10, activation='softmax')(x)

# Create the model
model = Model(inputs=inputs, outputs=outputs)

# Summary
model.summary()

In [None]:
# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Learning rate scheduler
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3)

# Add early stopping callback for training
early_stopping = EarlyStopping(monitor='val_loss',
                               patience=30, restore_best_weights=True)

# Train the model (pre-train)
history_pretrain = model.fit(
    x_train_pre, y_train_pre,
    epochs=150,
    batch_size=64,
    validation_data=(x_test_pre, y_test_pre),
    callbacks=[early_stopping, lr_scheduler]
)

Epoch 1/150
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m86s[0m 90ms/step - accuracy: 0.3170 - loss: 3.3364 - val_accuracy: 0.3801 - val_loss: 2.3813 - learning_rate: 0.0010
Epoch 2/150
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 76ms/step - accuracy: 0.5914 - loss: 1.7005 - val_accuracy: 0.2713 - val_loss: 2.8283 - learning_rate: 0.0010
Epoch 3/150
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 76ms/step - accuracy: 0.6578 - loss: 1.4073 - val_accuracy: 0.4363 - val_loss: 2.0823 - learning_rate: 0.0010
Epoch 4/150
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m59s[0m 76ms/step - accuracy: 0.7000 - loss: 1.2802 - val_accuracy: 0.6512 - val_loss: 1.4097 - learning_rate: 0.0010
Epoch 5/150
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 76ms/step - accuracy: 0.7288 - loss: 1.2065 - val_accuracy: 0.5693 - val_loss: 1.6567 - learning_rate: 0.0010
Epoch 6/150
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[

In [None]:
# Evaluate initial model
test_loss, test_acc = model.evaluate(x_test_pre, y_test_pre,
                                     verbose=0)

print(f'Test accuracy: {round(test_acc,2)*100}%')
print(f'Test loss: {round(test_loss,2)}')

Test accuracy: 85.0%
Test loss: 0.79


## Load and Preprocess the CIFAR-100 Dataset (Fine-tuning)

Here's where we're going to bring in our other dataset, CIFAR-100, which we will fine-tune our model on. It will be interesting to pay attention to the metrics for each epoch to see how the model is learning.

In [None]:
# Load in CIFAR-100
(x_train_fine, y_train_fine), (x_test_fine, y_test_fine) = cifar100.load_data()

# Normalize pixel values
x_train_fine = x_train_fine.astype('float32') / 255.0
x_test_fine = x_test_fine.astype('float32') / 255.0

# Print dataset shapes
print(f'CIFAR-100 Training data shape: {x_train_fine.shape}')
print(f'CIFAR-100 Training labels shape: {y_train_fine.shape}')

CIFAR-100 Training data shape: (50000, 32, 32, 3)
CIFAR-100 Training labels shape: (50000, 1)


## Modify the Pre-trained Model for Fine-tuning

Here we will remove the "prediction head" for the model and replace it with a different architecture. The most important part to replace is that of the final dense layer, which will produce our softmax output for the 100 different classes. If you look back, you will see that we had it set up to do softmax over just 10 classes, which wouldn't work well in this case, since the model is only trained to predict 10 classes.

In [None]:
# Access the layer just before the dense classification layers
x = model.layers[3].output

# Add in new residual blocks
x = residual_block(x, 30)
x = layers.MaxPooling2D((2, 2))(x)

x = residual_block(x, 52)
x = layers.MaxPooling2D((2, 2))(x)

x = residual_block(x, 64)
x = layers.MaxPooling2D((2, 2))(x)

# Reduce dimensionality
x = layers.GlobalAveragePooling2D()(x)

# New model output - 100 classes
outputs = layers.Dense(100, activation='softmax')(x)

# Create new model with updated layers
model_finetune = Model(inputs=model.input, outputs=outputs)

# Freeze the layers up to the first convolutional block
for layer in model_finetune.layers[:3]:
    layer.trainable = False

# Model summary
model_finetune.summary()

## Fine-tune

Here's where we fine-tune the model on the CIFAR-100 dataset to see how it performs. Note how we freeze the layers up to the first convolutional block in the model to try and keep some of the low-level features learned.

In [None]:
# Callback to reduce learning rate on plateau in performance
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss', factor=0.5, patience=7, min_lr=1e-6, verbose=1
)

# Early stopping callback
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                                  patience=20,
                                                  restore_best_weights=True)

# Set up initial learning rate
initial_learning_rate = 1e-3

# Set up optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=initial_learning_rate)

# Perform light data augmentation
data_augmentation = tf.keras.Sequential([
    layers.RandomFlip("horizontal"),
    layers.RandomRotation(0.1),
    layers.RandomZoom(0.1),
    layers.RandomBrightness(0.05),
    layers.RandomContrast(0.05),
])

# Apply this augmentation to the training data only
x_train_augmented = data_augmentation(x_train_fine)

# Set up loss function
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)


# Compile model
model_finetune.compile(optimizer=optimizer,
                       loss=loss,
                       metrics=['accuracy'])


# Train the model
history = model_finetune.fit(
    x_train_fine, y_train_fine, batch_size=32,
    epochs=100,
    validation_data=(x_test_fine, y_test_fine),
    callbacks=[early_stopping, reduce_lr] # Remove reduce_lr from callbacks
)

Epoch 1/100
[1m1563/1563[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 15ms/step - accuracy: 0.1261 - loss: 4.1593 - val_accuracy: 0.2328 - val_loss: 3.3571 - learning_rate: 0.0010
Epoch 2/100
[1m1563/1563[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 8ms/step - accuracy: 0.3037 - loss: 2.9786 - val_accuracy: 0.2082 - val_loss: 3.8296 - learning_rate: 0.0010
Epoch 3/100
[1m1563/1563[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 8ms/step - accuracy: 0.3871 - loss: 2.5853 - val_accuracy: 0.3293 - val_loss: 2.9763 - learning_rate: 0.0010
Epoch 4/100
[1m1563/1563[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 8ms/step - accuracy: 0.4284 - loss: 2.4065 - val_accuracy: 0.3688 - val_loss: 2.7879 - learning_rate: 0.0010
Epoch 5/100
[1m1563/1563[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 8ms/step - accuracy: 0.4638 - loss: 2.2632 - val_accuracy: 0.3632 - val_loss: 2.7871 - learning_rate: 0.0010
Epoch 6/100
[1m1563/1563[0m [32m━━━━━━━━━━━━━━

In [None]:
# Evaluate fine-tuned model
test_loss, test_acc = model_finetune.evaluate(x_test_fine, y_test_fine,
                                              verbose=0)


print((f'Test accuracy on CIFAR-100: {round(test_acc*100,2)}%\nTest loss on CIFAR-100: {test_loss}'))

Test accuracy on CIFAR-100: 55.16%
Test loss on CIFAR-100: 2.140716552734375


## Wrap Up

In this notebook, we explored the process of pre-training a convolutional neural network (CNN) on the CIFAR-10 dataset and then fine-tuning it for classification on the CIFAR-100 dataset using the Keras Functional API. This exercise demonstrates the practical application of transfer learning and highlights the importance of fine-tuning models when adapting them to new but related tasks.

This was a tougher task than most, since our initial model was trained on something far less complicated than what we attempted to fine-tune it to. Most of the time, you will need to do the opposite. However, this serves as a great challenge and learning opportunity.

**Future areas you could explore:**
- Hyperparameter Optimization: Explore various hyperparameters such as learning rates, batch sizes, and optimizers to further enhance the model's performance.
- Freezing layers: Typically, when fine-tuning, you will opt to freeze certain layers (usually lower) in the model, so that you don't destroy some of the foundational knowledge gained by the model in the pre-training stage. In this case, you could experiment with freezing more or less layers.
- Data Augmentation Techniques: Implement further data augmentation to increase the diversity of the training data, which can help improve the model's generalization capabilities.
- Application to Other Datasets: Apply the transfer learning and fine-tuning approach to other datasets and domains to assess its versatility and effectiveness in different contexts.