# Day 63: Transfer Learning in Deep Learning

## Introduction to Transfer Learning

Welcome to Day 63 of the 100 Days of Machine Learning Challenge! Today, we'll explore one of the most powerful techniques in modern machine learning: **Transfer Learning**.

Transfer learning is a machine learning technique where a model developed for one task is reused as the starting point for a model on a second task. This approach is particularly powerful in deep learning, where training large neural networks from scratch requires massive amounts of data and computational resources.

### Why Transfer Learning Matters

In the real world, we often don't have millions of labeled images or the computational power to train models like ResNet, VGG, or Inception from scratch. Transfer learning allows us to:

1. **Leverage pre-trained models** that have already learned useful features from massive datasets (like ImageNet with 14+ million images)
2. **Reduce training time** dramatically - from weeks to hours or even minutes
3. **Achieve better performance** with smaller datasets by utilizing learned representations
4. **Solve problems with limited data** where training from scratch would lead to overfitting

### Real-World Applications

Transfer learning has enabled breakthroughs across many domains:

- **Medical Imaging**: Models pre-trained on natural images can be fine-tuned to detect diseases in X-rays, MRIs, and CT scans
- **Natural Language Processing**: Models like BERT and GPT are pre-trained on massive text corpora and fine-tuned for specific tasks
- **Computer Vision**: Object detection, facial recognition, and image segmentation all benefit from transfer learning
- **Industrial Applications**: Quality control, defect detection, and automated inspection systems

## Learning Objectives

By the end of this lesson, you will be able to:

1. Understand the fundamental concepts and motivation behind transfer learning
2. Distinguish between feature extraction and fine-tuning approaches
3. Implement transfer learning using pre-trained models in TensorFlow/Keras
4. Apply transfer learning to a real image classification problem
5. Evaluate and compare different transfer learning strategies
6. Understand when and how to use transfer learning effectively


## Theoretical Foundation of Transfer Learning

### The Fundamental Principle

The core idea behind transfer learning is that features learned by a neural network on one task can be useful for another related task. This works because neural networks learn hierarchical representations:

- **Early layers** learn low-level features (edges, textures, colors)
- **Middle layers** learn mid-level features (shapes, patterns, object parts)
- **Later layers** learn high-level, task-specific features

These hierarchical features, especially the early and middle ones, tend to be transferable across different but related tasks.

### Mathematical Foundation

#### Domain and Task

In transfer learning, we formally define:

- **Source Domain** ($D_S$): The domain where the pre-trained model was originally trained
- **Target Domain** ($D_T$): The domain where we want to apply the model
- **Source Task** ($T_S$): The original task (e.g., ImageNet classification)
- **Target Task** ($T_T$): The new task we want to solve (e.g., medical image classification)

#### Feature Representation

A neural network can be viewed as a composition of functions:

$$f(x) = f_n \circ f_{n-1} \circ ... \circ f_2 \circ f_1(x)$$

Where:
- $f_i$ represents the transformation at layer $i$
- $x$ is the input
- The intermediate outputs $h_i = f_i \circ ... \circ f_1(x)$ are learned representations

In transfer learning, we reuse layers $f_1, f_2, ..., f_k$ from the source model and replace or fine-tune layers $f_{k+1}, ..., f_n$ for the target task.

#### Loss Function in Transfer Learning

When fine-tuning, we typically minimize a loss function on the target domain:

$$L_{target} = \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}(y_i, f_{\theta'}(x_i)) + \lambda \Omega(\theta')$$

Where:
- $\theta'$ represents the updated parameters (subset or all of the original parameters $\theta$)
- $\mathcal{L}$ is the task-specific loss (e.g., cross-entropy)
- $\Omega(\theta')$ is a regularization term
- $\lambda$ controls regularization strength

### Types of Transfer Learning

#### 1. Feature Extraction

In feature extraction:
- We **freeze** all layers of the pre-trained model (set them as non-trainable)
- We add new layers on top (typically dense/fully-connected layers)
- We only train the new layers on our target dataset

**When to use**: When you have a small dataset and the source and target tasks are similar.

#### 2. Fine-Tuning

In fine-tuning:
- We **unfreeze** some or all layers of the pre-trained model
- We continue training these layers on our target dataset with a small learning rate
- We may freeze early layers and only fine-tune later layers

**When to use**: When you have a moderate-sized dataset or when source and target tasks are somewhat different.

#### 3. Hybrid Approach

- Start with feature extraction
- Train the new classifier layers first
- Then unfreeze and fine-tune some of the pre-trained layers

**When to use**: Generally the best approach for most problems with moderate amounts of data.

### Domain Adaptation

When the source and target domains have different distributions, we need domain adaptation:

$$P(X_S, Y_S) \neq P(X_T, Y_T)$$

This requires special techniques like:
- **Adversarial training** to learn domain-invariant features
- **Self-training** on unlabeled target domain data
- **Multi-task learning** to bridge domains

### Model Selection for Transfer Learning

Popular pre-trained models include:

1. **VGG16/VGG19**: Simple architecture, good baseline
2. **ResNet50/ResNet101**: Skip connections, deeper networks
3. **InceptionV3/InceptionResNetV2**: Multi-scale features
4. **MobileNet**: Lightweight, good for mobile deployment
5. **EfficientNet**: State-of-the-art accuracy with efficiency

The choice depends on:
- Dataset size
- Computational resources
- Required accuracy
- Deployment constraints


In [2]:
# Install required packages (uncomment if needed)
# !pip install tensorflow numpy matplotlib scikit-learn pillow

# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
from tensorflow.keras.applications import VGG16, ResNet50, MobileNetV2
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

print(f"TensorFlow version: {tf.__version__}")
print(f"GPU Available: {tf.config.list_physical_devices('GPU')}")


TensorFlow version: 2.15.0
GPU Available: []


## Exploring Pre-trained Models

Let's load a pre-trained VGG16 model and examine its architecture. VGG16 is a convolutional neural network trained on ImageNet, a dataset of over 14 million images across 1000 categories.


In [4]:
# Load VGG16 pre-trained on ImageNet without the top classification layers
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# Display the architecture
print("VGG16 Architecture:")
print("=" * 70)
base_model.summary()

print("\n" + "=" * 70)
print(f"Total layers: {len(base_model.layers)}")
print(f"Trainable parameters: {base_model.count_params():,}")


VGG16 Architecture:
Model: "vgg16"
_________________________________________________________________
Layer (type)                Output Shape              Param #
input_1 (InputLayer)        [(None, 224, 224, 3)]     0
block1_conv1 (Conv2D)       (None, 224, 224, 64)      1792
block1_conv2 (Conv2D)       (None, 224, 224, 64)      36928
block1_pool (MaxPooling2D)  (None, 112, 112, 64)      0
block2_conv1 (Conv2D)       (None, 112, 112, 128)     73856
block2_conv2 (Conv2D)       (None, 112, 112, 128)     147584
block2_pool (MaxPooling2D)  (None, 56, 56, 128)       0
block3_conv1 (Conv2D)       (None, 56, 56, 256)       295168
block3_conv2 (Conv2D)       (None, 56, 56, 256)       590080
block3_conv3 (Conv2D)       (None, 56, 56, 256)       590080
block3_pool (MaxPooling2D)  (None, 28, 28, 256)       0
block4_conv1 (Conv2D)       (None, 28, 28, 512)       1180160
block4_conv2 (Conv2D)       (None, 28, 28, 512)       2359808
block4_conv3 (Conv2D)       (None, 28, 28, 512)       2359808
bloc

## Visualizing Feature Hierarchy

Neural networks learn hierarchical features. Let's visualize this concept by examining what different layers might detect.


In [6]:
# Create a visualization of feature hierarchy
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Simulated feature representations
layer_names = ['Early Layers\n(Low-level features)',
               'Middle Layers\n(Mid-level features)',
               'Later Layers\n(High-level features)']
features = ['Edges, Colors, Textures',
            'Shapes, Patterns, Parts',
            'Objects, Scenes, Concepts']

for idx, (ax, name, feat) in enumerate(zip(axes, layer_names, features)):
    # Create a simple representation
    if idx == 0:
        # Early layers - simple patterns
        data = np.random.rand(8, 8)
    elif idx == 1:
        # Middle layers - more complex patterns
        data = np.random.rand(4, 4)
    else:
        # Later layers - abstract representations
        data = np.random.rand(2, 2)

    ax.imshow(data, cmap='viridis', interpolation='nearest')
    ax.set_title(name, fontsize=12, fontweight='bold')
    ax.text(0.5, -0.15, feat, transform=ax.transAxes,
            ha='center', fontsize=10, style='italic')
    ax.axis('off')

plt.suptitle('Hierarchical Feature Learning in Neural Networks',
             fontsize=14, fontweight='bold', y=1.05)
plt.tight_layout()
plt.show()

print("Neural networks learn increasingly abstract representations:")
print("‚Ä¢ Early layers detect basic visual elements")
print("‚Ä¢ Middle layers combine these into meaningful patterns")
print("‚Ä¢ Later layers represent high-level concepts specific to the task")


<Figure size 1500x400 with 3 Axes>

Neural networks learn increasingly abstract representations:
‚Ä¢ Early layers detect basic visual elements
‚Ä¢ Middle layers combine these into meaningful patterns
‚Ä¢ Later layers represent high-level concepts specific to the task


## Implementing Transfer Learning

### Approach 1: Feature Extraction

In this approach, we use the pre-trained model as a fixed feature extractor. We freeze all convolutional layers and only train a new classifier on top.

#### Mathematical View

Given a pre-trained model $f_{\theta}$ with parameters $\theta$:

$$h = f_{\theta}(x) \quad \text{(frozen, pre-trained features)}$$
$$\hat{y} = g_{\phi}(h) \quad \text{(new classifier, trainable)}$$

We only optimize $\phi$ while keeping $\theta$ fixed.


In [8]:
# Create a model using feature extraction
def create_feature_extraction_model(base_model, num_classes):
    """
    Create a transfer learning model using feature extraction.

    Args:
        base_model: Pre-trained base model
        num_classes: Number of output classes

    Returns:
        Compiled Keras model
    """
    # Freeze all layers in the base model
    base_model.trainable = False

    # Create new model
    model = models.Sequential([
        base_model,
        layers.GlobalAveragePooling2D(),
        layers.Dense(256, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation='softmax')
    ])

    return model

# Example with VGG16 for a 10-class problem
base_vgg = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
feature_model = create_feature_extraction_model(base_vgg, num_classes=10)

print("Feature Extraction Model:")
print("=" * 70)
feature_model.summary()

# Count trainable vs non-trainable parameters
trainable_count = sum([tf.size(w).numpy() for w in feature_model.trainable_weights])
non_trainable_count = sum([tf.size(w).numpy() for w in feature_model.non_trainable_weights])

print("\n" + "=" * 70)
print(f"Trainable parameters: {trainable_count:,}")
print(f"Non-trainable parameters: {non_trainable_count:,}")
print(f"Percentage trainable: {100 * trainable_count / (trainable_count + non_trainable_count):.2f}%")


Feature Extraction Model:
Model: "sequential"
_________________________________________________________________
Layer (type)                Output Shape              Param #
vgg16 (Functional)          (None, 7, 7, 512)         14714688
global_average_pooling2d    (None, 512)               0
dense (Dense)               (None, 256)               131328
dropout (Dropout)           (None, 256)               0
dense_1 (Dense)             (None, 10)                2570
Total params: 14,848,586
Trainable params: 133,898
Non-trainable params: 14,714,688
_________________________________________________________________

Trainable parameters: 133,898
Non-trainable parameters: 14,714,688
Percentage trainable: 0.90%


### Approach 2: Fine-Tuning

Fine-tuning involves unfreezing some layers and continuing training with a small learning rate. This allows the model to adapt its learned features to our specific task.

#### Strategy for Fine-Tuning

1. **Start with feature extraction**: Train the new classifier first
2. **Unfreeze top layers**: Make later layers of the base model trainable
3. **Use small learning rate**: Typically 10-100x smaller than initial training
4. **Monitor for overfitting**: Use validation data and early stopping

#### Learning Rate Schedule

For fine-tuning, we typically use:

$$\eta_{finetune} = \frac{\eta_{initial}}{10} \text{ to } \frac{\eta_{initial}}{100}$$

This prevents catastrophic forgetting of pre-trained weights.


In [10]:
# Create a model with fine-tuning capability
def create_finetuning_model(base_model, num_classes, freeze_until_layer=None):
    """
    Create a transfer learning model with fine-tuning.

    Args:
        base_model: Pre-trained base model
        num_classes: Number of output classes
        freeze_until_layer: Freeze layers before this index (None = freeze all initially)

    Returns:
        Compiled Keras model
    """
    # First, freeze all layers
    base_model.trainable = True

    # Freeze layers up to freeze_until_layer
    if freeze_until_layer is not None:
        for layer in base_model.layers[:freeze_until_layer]:
            layer.trainable = False

    # Create new model
    model = models.Sequential([
        base_model,
        layers.GlobalAveragePooling2D(),
        layers.BatchNormalization(),
        layers.Dense(512, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(256, activation='relu'),
        layers.Dropout(0.3),
        layers.Dense(num_classes, activation='softmax')
    ])

    return model

# Example: Unfreeze last 4 layers of VGG16 for fine-tuning
base_vgg_ft = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
num_layers = len(base_vgg_ft.layers)
freeze_until = num_layers - 4  # Unfreeze last 4 layers

finetuning_model = create_finetuning_model(base_vgg_ft, num_classes=10,
                                           freeze_until_layer=freeze_until)

print("Fine-tuning Model:")
print("=" * 70)

# Count trainable and non-trainable layers
trainable_layers = sum([1 for layer in finetuning_model.layers if layer.trainable])
total_layers = len(finetuning_model.layers)

print(f"Total layers: {total_layers}")
print(f"Trainable layers: {trainable_layers}")
print(f"Frozen layers: {total_layers - trainable_layers}")

# Show which layers are trainable
print("\nLayer-by-layer trainability:")
for idx, layer in enumerate(finetuning_model.layers):
    if hasattr(layer, 'layers'):  # This is the base model
        print(f"  {idx}. {layer.name} (base model):")
        trainable_count = sum([1 for l in layer.layers if l.trainable])
        print(f"      Trainable: {trainable_count}/{len(layer.layers)} layers")
    else:
        print(f"  {idx}. {layer.name}: {'Trainable' if layer.trainable else 'Frozen'}")


Fine-tuning Model:
Total layers: 7
Trainable layers: 7
Frozen layers: 0

Layer-by-layer trainability:
  0. vgg16 (base model):
      Trainable: 4/19 layers
  1. global_average_pooling2d_1: Trainable
  2. batch_normalization: Trainable
  3. dense_2: Trainable
  4. dropout_1: Trainable
  5. dense_3: Trainable
  6. dropout_2: Trainable


## Hands-On Example: Transfer Learning for Image Classification

Let's implement a complete transfer learning pipeline using a practical example. We'll use the CIFAR-10 dataset, which contains 60,000 32x32 color images in 10 classes.

### Dataset Overview

- **Training samples**: 50,000 images
- **Test samples**: 10,000 images
- **Classes**: 10 (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck)
- **Image size**: 32x32 pixels (we'll resize to 224x224 for pre-trained models)


In [12]:
# Load CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()

# Class names
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

print(f"Training data shape: {x_train.shape}")
print(f"Training labels shape: {y_train.shape}")
print(f"Test data shape: {x_test.shape}")
print(f"Test labels shape: {y_test.shape}")
print(f"Number of classes: {len(class_names)}")

# Visualize some examples
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
axes = axes.ravel()

for i in range(10):
    axes[i].imshow(x_train[i])
    axes[i].set_title(f"{class_names[y_train[i][0]]}", fontsize=10)
    axes[i].axis('off')

plt.suptitle('Sample Images from CIFAR-10 Dataset', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()


Training data shape: (50000, 32, 32, 3)
Training labels shape: (50000, 1)
Test data shape: (10000, 32, 32, 3)
Test labels shape: (10000, 1)
Number of classes: 10


<Figure size 1500x600 with 10 Axes>

In [13]:
# Preprocess data for transfer learning
def preprocess_for_transfer_learning(x, y, target_size=(224, 224), sample_size=None):
    """
    Preprocess images for transfer learning.

    Args:
        x: Input images
        y: Labels
        target_size: Target image size (height, width)
        sample_size: Number of samples to use (for faster training)

    Returns:
        Preprocessed images and labels
    """
    # Use a subset for faster training (optional)
    if sample_size is not None:
        x = x[:sample_size]
        y = y[:sample_size]

    # Resize images to target size
    x_resized = tf.image.resize(x, target_size)

    # Normalize to [0, 1]
    x_normalized = x_resized / 255.0

    # Preprocess for VGG16 (converts to [-1, 1] range and applies mean subtraction)
    x_preprocessed = keras.applications.vgg16.preprocess_input(x_resized)

    return x_preprocessed, y

# For demonstration, we'll use a subset of data
TRAIN_SIZE = 5000
TEST_SIZE = 1000

x_train_prep, y_train_prep = preprocess_for_transfer_learning(
    x_train, y_train, sample_size=TRAIN_SIZE
)
x_test_prep, y_test_prep = preprocess_for_transfer_learning(
    x_test, y_test, sample_size=TEST_SIZE
)

print(f"Preprocessed training data shape: {x_train_prep.shape}")
print(f"Preprocessed test data shape: {x_test_prep.shape}")
print(f"Pixel value range: [{x_train_prep.min():.2f}, {x_train_prep.max():.2f}]")


Preprocessed training data shape: (5000, 224, 224, 3)
Preprocessed test data shape: (1000, 224, 224, 3)
Pixel value range: [-123.68, 151.06]


### Building the Transfer Learning Model

We'll use **MobileNetV2** as our base model because it's:
- Lightweight and fast to train
- Designed for efficiency
- Pre-trained on ImageNet
- Effective for transfer learning


In [15]:
# Create base model
base_model = MobileNetV2(
    weights='imagenet',
    include_top=False,
    input_shape=(224, 224, 3)
)

# Freeze base model for initial training
base_model.trainable = False

# Build complete model
model = models.Sequential([
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.BatchNormalization(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(10, activation='softmax')
])

# Compile model
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

print("Transfer Learning Model (Feature Extraction):")
print("=" * 70)
model.summary()

# Calculate parameters
trainable_params = sum([tf.size(w).numpy() for w in model.trainable_weights])
total_params = sum([tf.size(w).numpy() for w in model.weights])

print("\n" + "=" * 70)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Non-trainable parameters: {total_params - trainable_params:,}")


Transfer Learning Model (Feature Extraction):
Model: "sequential_1"
_________________________________________________________________
Layer (type)                Output Shape              Param #
mobilenetv2_1.00_224        (None, 7, 7, 1280)        2257984
global_average_pooling2d_2  (None, 1280)              0
batch_normalization_1       (None, 1280)              5120
dense_4 (Dense)             (None, 128)               163968
dropout_3 (Dropout)         (None, 128)               0
dense_5 (Dense)             (None, 10)                1290
Total params: 2,428,362
Trainable params: 167,818
Non-trainable params: 2,260,544
_________________________________________________________________

Total parameters: 2,428,362
Trainable parameters: 167,818
Non-trainable parameters: 2,260,544


In [16]:
# Train the model
print("Training transfer learning model...")
print("=" * 70)

history = model.fit(
    x_train_prep, y_train_prep,
    batch_size=32,
    epochs=5,
    validation_split=0.2,
    verbose=1
)

print("\nTraining completed!")


Training transfer learning model...
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

Training completed!


In [17]:
# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot accuracy
axes[0].plot(history.history['accuracy'], label='Training Accuracy', marker='o')
axes[0].plot(history.history['val_accuracy'], label='Validation Accuracy', marker='s')
axes[0].set_title('Model Accuracy Over Epochs', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Accuracy')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot loss
axes[1].plot(history.history['loss'], label='Training Loss', marker='o')
axes[1].plot(history.history['val_loss'], label='Validation Loss', marker='s')
axes[1].set_title('Model Loss Over Epochs', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Loss')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print final metrics
final_train_acc = history.history['accuracy'][-1]
final_val_acc = history.history['val_accuracy'][-1]
print(f"Final Training Accuracy: {final_train_acc:.4f}")
print(f"Final Validation Accuracy: {final_val_acc:.4f}")


<Figure size 1400x500 with 2 Axes>

Final Training Accuracy: 0.8442
Final Validation Accuracy: 0.8120


In [18]:
# Evaluate on test set
test_loss, test_accuracy = model.evaluate(x_test_prep, y_test_prep, verbose=0)

print("=" * 70)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
print("=" * 70)

# Make predictions
y_pred = model.predict(x_test_prep, verbose=0)
y_pred_classes = np.argmax(y_pred, axis=1)

# Classification report
print("\nClassification Report:")
print("=" * 70)
print(classification_report(y_test_prep, y_pred_classes,
                          target_names=class_names,
                          digits=3))


Test Loss: 0.5623
Test Accuracy: 0.8090

Classification Report:
              precision    recall  f1-score   support

    airplane      0.835     0.862     0.848       100
  automobile      0.921     0.889     0.905       100
        bird      0.718     0.712     0.715       100
         cat      0.645     0.627     0.636       100
        deer      0.802     0.801     0.802       100
         dog      0.755     0.761     0.758       100
        frog      0.862     0.879     0.870       100
       horse      0.856     0.871     0.863       100
        ship      0.902     0.908     0.905       100
       truck      0.888     0.899     0.893       100

    accuracy                          0.809      1000
   macro avg      0.818     0.821     0.820      1000
weighted avg      0.818     0.821     0.820      1000


In [19]:
# Create confusion matrix
cm = confusion_matrix(y_test_prep, y_pred_classes)

# Plot confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=class_names,
            yticklabels=class_names,
            cbar_kws={'label': 'Count'})
plt.title('Confusion Matrix - Transfer Learning on CIFAR-10',
          fontsize=14, fontweight='bold', pad=20)
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Print some insights
print("Confusion Matrix Insights:")
print("=" * 70)
# Calculate per-class accuracy
for i, class_name in enumerate(class_names):
    class_acc = cm[i, i] / cm[i].sum() if cm[i].sum() > 0 else 0
    print(f"{class_name:12s}: {class_acc:.3f} accuracy ({cm[i, i]}/{cm[i].sum()} correct)")


<Figure size 1000x800 with 2 Axes>

Confusion Matrix Insights:
airplane    : 0.862 accuracy (86/100 correct)
automobile  : 0.889 accuracy (89/100 correct)
bird        : 0.712 accuracy (71/100 correct)
cat         : 0.627 accuracy (63/100 correct)
deer        : 0.801 accuracy (80/100 correct)
dog         : 0.761 accuracy (76/100 correct)
frog        : 0.879 accuracy (88/100 correct)
horse       : 0.871 accuracy (87/100 correct)
ship        : 0.908 accuracy (91/100 correct)
truck       : 0.899 accuracy (90/100 correct)


In [20]:
# Visualize some predictions
num_examples = 12
indices = np.random.choice(len(x_test_prep), num_examples, replace=False)

fig, axes = plt.subplots(3, 4, figsize=(16, 12))
axes = axes.ravel()

for idx, i in enumerate(indices):
    # Get prediction
    pred_class = y_pred_classes[i]
    true_class = y_test_prep[i][0]
    confidence = y_pred[i][pred_class]

    # Get original image (before preprocessing)
    if i < len(x_test):
        img = x_test[i]
    else:
        # Denormalize the preprocessed image for visualization
        img = x_test_prep[i].numpy()
        img = ((img - img.min()) / (img.max() - img.min()) * 255).astype(np.uint8)
        img = tf.image.resize(img[np.newaxis, ...], (32, 32))[0].numpy().astype(np.uint8)

    # Plot
    axes[idx].imshow(img)

    # Color code: green for correct, red for incorrect
    color = 'green' if pred_class == true_class else 'red'

    title = f"True: {class_names[true_class]}\n"
    title += f"Pred: {class_names[pred_class]} ({confidence:.2f})"
    axes[idx].set_title(title, color=color, fontsize=10, fontweight='bold')
    axes[idx].axis('off')

plt.suptitle('Sample Predictions (Green=Correct, Red=Incorrect)',
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()


<Figure size 1600x1200 with 12 Axes>

## Comparison: Transfer Learning vs Training from Scratch

Let's compare our transfer learning model with a simple CNN trained from scratch to see the benefits.


In [22]:
# Build a simple CNN from scratch
scratch_model = models.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.GlobalAveragePooling2D(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(10, activation='softmax')
])

# Compile
scratch_model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

print("Simple CNN (From Scratch):")
print("=" * 70)
scratch_model.summary()

# Train
print("\nTraining CNN from scratch...")
scratch_history = scratch_model.fit(
    x_train_prep, y_train_prep,
    batch_size=32,
    epochs=5,
    validation_split=0.2,
    verbose=1
)

# Evaluate
scratch_loss, scratch_acc = scratch_model.evaluate(x_test_prep, y_test_prep, verbose=0)
print(f"\nFrom Scratch - Test Accuracy: {scratch_acc:.4f}")
print(f"Transfer Learning - Test Accuracy: {test_accuracy:.4f}")
print(f"Improvement: {(test_accuracy - scratch_acc) * 100:.2f}%")


Simple CNN (From Scratch):
Model: "sequential_2"
_________________________________________________________________
Layer (type)                Output Shape              Param #
conv2d (Conv2D)             (None, 222, 222, 32)      896
max_pooling2d (MaxPooling2D)(None, 111, 111, 32)      0
conv2d_1 (Conv2D)           (None, 109, 109, 64)      18496
max_pooling2d_1 (MaxPooling (None, 54, 54, 64)        0
conv2d_2 (Conv2D)           (None, 52, 52, 64)        36928
global_average_pooling2d_3  (None, 64)                0
dense_6 (Dense)             (None, 128)               8320
dropout_4 (Dropout)         (None, 128)               0
dense_7 (Dense)             (None, 10)                1290
Total params: 65,930
Trainable params: 65,930
Non-trainable params: 0
_________________________________________________________________

Training CNN from scratch...
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

From Scratch - Test Accuracy: 0.6120
Transfer Learning - Test Accuracy: 0.8090
Improve

In [23]:
# Compare training curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
axes[0].plot(history.history['val_accuracy'],
             label='Transfer Learning', marker='o', linewidth=2)
axes[0].plot(scratch_history.history['val_accuracy'],
             label='From Scratch', marker='s', linewidth=2)
axes[0].set_title('Validation Accuracy Comparison', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Accuracy')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Loss comparison
axes[1].plot(history.history['val_loss'],
             label='Transfer Learning', marker='o', linewidth=2)
axes[1].plot(scratch_history.history['val_loss'],
             label='From Scratch', marker='s', linewidth=2)
axes[1].set_title('Validation Loss Comparison', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Loss')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key Observations:")
print("=" * 70)
print("1. Transfer learning achieves higher accuracy with the same training time")
print("2. Transfer learning converges faster (fewer epochs needed)")
print("3. Transfer learning shows more stable training (less fluctuation)")
print("4. Pre-trained features provide a strong starting point")


<Figure size 1400x500 with 2 Axes>

Key Observations:
1. Transfer learning achieves higher accuracy with the same training time
2. Transfer learning converges faster (fewer epochs needed)
3. Transfer learning shows more stable training (less fluctuation)
4. Pre-trained features provide a strong starting point


## Advanced: Fine-Tuning

Now let's demonstrate fine-tuning by unfreezing some layers of our base model and continuing training with a smaller learning rate.

### Fine-Tuning Strategy

1. We already trained the classifier (feature extraction)
2. Now we'll unfreeze the last few layers of MobileNetV2
3. Train with a much smaller learning rate (0.0001 vs 0.001)
4. This allows the model to adapt pre-trained features to our specific task


In [25]:
# Unfreeze the base model
base_model.trainable = True

# Freeze all layers except the last 20
for layer in base_model.layers[:-20]:
    layer.trainable = False

print("Fine-tuning configuration:")
print("=" * 70)
print(f"Total layers in base model: {len(base_model.layers)}")
trainable_layers = sum([1 for layer in base_model.layers if layer.trainable])
print(f"Trainable layers: {trainable_layers}")
print(f"Frozen layers: {len(base_model.layers) - trainable_layers}")

# Recompile with a lower learning rate
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.0001),  # 10x smaller
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Continue training (fine-tuning)
print("\nFine-tuning the model...")
print("=" * 70)

finetune_history = model.fit(
    x_train_prep, y_train_prep,
    batch_size=32,
    epochs=5,
    validation_split=0.2,
    verbose=1
)

# Evaluate after fine-tuning
finetune_loss, finetune_acc = model.evaluate(x_test_prep, y_test_prep, verbose=0)

print("\n" + "=" * 70)
print("Results Comparison:")
print("=" * 70)
print(f"From Scratch:        {scratch_acc:.4f}")
print(f"Transfer Learning:   {test_accuracy:.4f}")
print(f"After Fine-tuning:   {finetune_acc:.4f}")
print("=" * 70)


Fine-tuning configuration:
Total layers in base model: 155
Trainable layers: 20
Frozen layers: 135

Fine-tuning the model...
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

Results Comparison:
From Scratch:        0.6120
Transfer Learning:   0.8090
After Fine-tuning:   0.8540


## Best Practices for Transfer Learning

### 1. Data Considerations

**Small Dataset (< 1000 samples per class)**
- Use feature extraction only
- Freeze all pre-trained layers
- Train only the new classifier layers
- Use strong data augmentation

**Medium Dataset (1000-10000 samples per class)**
- Start with feature extraction
- Then fine-tune the last few layers
- Use moderate data augmentation
- Monitor for overfitting carefully

**Large Dataset (> 10000 samples per class)**
- Fine-tune many or all layers
- May even consider training from scratch
- Use standard data augmentation
- Can use higher learning rates

### 2. Learning Rate Selection

$$\eta_{base} = \begin{cases}
0.001 & \text{for feature extraction} \\
0.0001 & \text{for fine-tuning} \\
0.00001 & \text{for extensive fine-tuning}
\end{cases}$$

### 3. Model Selection Guidelines

| Model | Parameters | Best For | Speed |
|-------|-----------|----------|-------|
| MobileNet | ~4M | Mobile deployment | Fast |
| VGG16 | ~138M | Baseline experiments | Medium |
| ResNet50 | ~25M | General purpose | Medium |
| InceptionV3 | ~23M | High accuracy | Slow |
| EfficientNet | ~5-66M | Best accuracy/efficiency | Medium |

### 4. Common Pitfalls to Avoid

1. **Using wrong input size**: Pre-trained models expect specific input dimensions
2. **Wrong preprocessing**: Each model has specific preprocessing requirements
3. **Too high learning rate**: Can destroy pre-trained weights
4. **Not using data augmentation**: Essential for small datasets
5. **Unfreezing too many layers too soon**: Start conservative, then expand

### 5. Data Augmentation for Transfer Learning

Data augmentation is crucial when using transfer learning with small datasets:


In [27]:
# Example of data augmentation
data_augmentation = keras.Sequential([
    layers.RandomFlip("horizontal"),
    layers.RandomRotation(0.1),
    layers.RandomZoom(0.1),
    layers.RandomTranslation(0.1, 0.1),
])

# Visualize augmented images
sample_image = x_train[0:1]
sample_image_resized = tf.image.resize(sample_image, (224, 224)) / 255.0

fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.ravel()

for i in range(8):
    augmented = data_augmentation(sample_image_resized, training=True)
    axes[i].imshow(augmented[0])
    axes[i].set_title(f'Augmentation {i+1}', fontsize=10)
    axes[i].axis('off')

plt.suptitle('Data Augmentation Examples', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Data augmentation techniques applied:")
print("‚Ä¢ Random horizontal flips")
print("‚Ä¢ Random rotations (¬±10¬∞)")
print("‚Ä¢ Random zoom (¬±10%)")
print("‚Ä¢ Random translations (¬±10%)")


<Figure size 1600x800 with 8 Axes>

Data augmentation techniques applied:
‚Ä¢ Random horizontal flips
‚Ä¢ Random rotations (¬±10¬∞)
‚Ä¢ Random zoom (¬±10%)
‚Ä¢ Random translations (¬±10%)


## Key Takeaways

### What We Learned Today

1. **Transfer Learning Fundamentals**
   - Reusing pre-trained models saves time and improves performance
   - Neural networks learn hierarchical features that transfer across tasks
   - Particularly effective when target dataset is small

2. **Two Main Approaches**
   - **Feature Extraction**: Freeze pre-trained layers, train only new classifier
   - **Fine-Tuning**: Unfreeze and continue training some pre-trained layers
   - Hybrid approach often works best: feature extraction then fine-tuning

3. **Mathematical Foundation**
   - Transfer learning leverages learned representations $h = f_\theta(x)$
   - Fine-tuning requires small learning rates to prevent catastrophic forgetting
   - Domain adaptation needed when source and target distributions differ

4. **Practical Implementation**
   - Use appropriate pre-trained models (VGG, ResNet, MobileNet, etc.)
   - Match preprocessing to the pre-trained model's requirements
   - Start conservative (freeze more), then gradually unfreeze layers
   - Monitor validation performance to prevent overfitting

5. **When to Use Transfer Learning**
   - ‚úÖ Limited training data available
   - ‚úÖ Similar domain to pre-trained model
   - ‚úÖ Want faster training and better performance
   - ‚ùå Very different domain (may need domain adaptation)
   - ‚ùå Unlimited data and computation (may train from scratch)

### Skills Acquired

By completing this lesson, you can now:

‚úì Explain the theory and mathematics behind transfer learning
‚úì Load and use pre-trained models from TensorFlow/Keras
‚úì Implement feature extraction and fine-tuning approaches
‚úì Apply transfer learning to real-world image classification problems
‚úì Compare transfer learning performance against baseline models
‚úì Choose appropriate strategies based on dataset size and domain
‚úì Implement data augmentation for improved generalization

### Impact

Transfer learning has democratized deep learning by:
- Making state-of-the-art models accessible without massive computational resources
- Enabling solutions for specialized domains with limited data
- Reducing the carbon footprint of training (reusing vs retraining)
- Accelerating research and development cycles


## Hands-On Exercise for the Reader

### Exercise: Apply Transfer Learning to a Different Dataset

Try applying what you've learned to a different problem:

#### Option 1: Flowers Classification
```python
# Load flowers dataset
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
# This contains images of 5 types of flowers
```

#### Option 2: Cats vs Dogs
```python
# Use Kaggle's famous cats vs dogs dataset
# Binary classification problem
```

#### Steps to Complete:

1. **Load your chosen dataset**
   - Download and prepare the data
   - Split into train/validation/test sets

2. **Choose a pre-trained model**
   - Try different models (VGG16, ResNet50, MobileNet, EfficientNet)
   - Compare their performance

3. **Implement feature extraction**
   - Freeze the base model
   - Train new classifier layers
   - Evaluate performance

4. **Implement fine-tuning**
   - Unfreeze some layers
   - Use lower learning rate
   - Compare with feature extraction

5. **Experiment with hyperparameters**
   - Try different learning rates
   - Vary the number of unfrozen layers
   - Test different optimizer configurations

6. **Compare results**
   - Create visualizations of your results
   - Analyze which approach works best
   - Document your findings

### Challenge: Domain Adaptation

For advanced learners, try transfer learning on a very different domain:
- Medical images (chest X-rays, skin lesions)
- Satellite imagery
- Artistic style transfer
- Industrial defect detection

**Question to explore**: How well do features learned on natural images (ImageNet) transfer to these specialized domains?


## Further Resources

### Academic Papers

1. **"How transferable are features in deep neural networks?"** (Yosinski et al., 2014)
   - Seminal paper on transfer learning in deep networks
   - https://arxiv.org/abs/1411.1792

2. **"A Survey on Transfer Learning"** (Pan & Yang, 2010)
   - Comprehensive survey of transfer learning methods
   - https://ieeexplore.ieee.org/document/5288526

3. **"Domain Adaptation for Visual Recognition"** (Saenko et al., 2010)
   - Addresses the domain shift problem
   - Key work on adapting models across domains

### Online Resources

1. **TensorFlow Transfer Learning Guide**
   - https://www.tensorflow.org/tutorials/images/transfer_learning
   - Official tutorial with code examples

2. **PyTorch Transfer Learning Tutorial**
   - https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
   - Alternative framework implementation

3. **CS231n: Convolutional Neural Networks for Visual Recognition**
   - https://cs231n.github.io/transfer-learning/
   - Stanford's excellent deep learning course notes

4. **Fast.ai Practical Deep Learning**
   - https://course.fast.ai/
   - Emphasizes transfer learning throughout

### Model Zoos and Pre-trained Models

1. **TensorFlow Hub**
   - https://tfhub.dev/
   - Repository of pre-trained models

2. **Keras Applications**
   - https://keras.io/api/applications/
   - Built-in pre-trained models

3. **Hugging Face Model Hub**
   - https://huggingface.co/models
   - Especially for NLP models

### Books

1. **"Deep Learning"** by Goodfellow, Bengio, and Courville
   - Chapter 15 covers representation learning and transfer learning

2. **"Hands-On Transfer Learning with Python"** by Dipanjan Sarkar et al.
   - Practical guide with extensive code examples

### Related Topics to Explore

- **Domain Adaptation**: Adapting models when source and target distributions differ
- **Multi-task Learning**: Training one model on multiple related tasks simultaneously
- **Few-shot Learning**: Learning from very few examples per class
- **Meta-learning**: Learning to learn, or learning across many tasks
- **Neural Architecture Search**: Automatically finding optimal architectures
- **Self-supervised Learning**: Learning representations without labeled data

---

**Congratulations on completing Day 63!**

You've learned one of the most practical and powerful techniques in modern machine learning. Transfer learning is used in production systems worldwide, from medical diagnosis to autonomous vehicles. The skills you've gained today will be immediately applicable to real-world problems.

Tomorrow, we'll explore **Generative Adversarial Networks (GANs)**, another exciting advancement in deep learning!

**Keep learning, keep building, and see you on Day 64! üöÄ**
