# Day 65: Transfer Learning with Pre-trained Models

Transfer learning is one of the most powerful techniques in modern deep learning, enabling us to leverage knowledge from models trained on large datasets and apply it to new, often smaller datasets. This approach has revolutionized how we approach machine learning problems, particularly in computer vision and natural language processing.

## What is Transfer Learning?

Transfer learning is a machine learning technique where a model developed for one task is reused as the starting point for a model on a second task. Instead of training a deep neural network from scratch, which requires massive amounts of data and computational resources, we can take a pre-trained model that has already learned useful features from a large dataset and adapt it to our specific problem.

The intuition behind transfer learning is simple yet powerful: features learned from one task can be useful for another related task. For example, a model trained to recognize everyday objects has learned to detect edges, shapes, textures, and patternsâ€”features that are useful for many other computer vision tasks.

## Why Transfer Learning Matters

Transfer learning addresses several key challenges in machine learning:

1. **Limited Data**: Many real-world problems don't have millions of labeled examples. Transfer learning allows us to achieve good performance even with relatively small datasets.

2. **Computational Efficiency**: Training deep neural networks from scratch can take days or weeks on powerful GPUs. Transfer learning dramatically reduces training time.

3. **Better Generalization**: Pre-trained models have learned robust features from diverse data, often leading to better generalization on new tasks.

4. **Democratization of AI**: Transfer learning makes state-of-the-art models accessible to researchers and practitioners who don't have access to massive computational resources.

## Learning Objectives

By the end of this lesson, you will be able to:

- Understand the fundamental concepts of transfer learning
- Distinguish between feature extraction and fine-tuning approaches
- Load and use pre-trained models from popular frameworks
- Adapt pre-trained models to new classification tasks
- Evaluate the performance of transfer learning models

## Theory: Understanding Transfer Learning

### The Mathematical Foundation

In traditional supervised learning, we aim to learn a function $f: X \rightarrow Y$ that maps inputs $X$ from a source domain $D_S$ to outputs $Y$ based on a task $T_S$. The model learns parameters $\theta$ by minimizing a loss function:

$$\theta^* = \arg\min_{\theta} \mathcal{L}(f(X; \theta), Y)$$

In transfer learning, we leverage knowledge from a source domain $D_S$ and source task $T_S$ to improve learning in a target domain $D_T$ and target task $T_T$. The key assumption is that the source and target domains/tasks are related but not identical.

### Types of Transfer Learning

**1. Feature Extraction (Frozen Layers)**

In feature extraction, we use the pre-trained model as a fixed feature extractor. The convolutional base of a pre-trained network is frozen (weights are not updated), and we only train a new classifier on top:

$$h = f_{pretrained}(x; \theta_{frozen})$$
$$\hat{y} = g(h; \theta_{new})$$

where $f_{pretrained}$ represents the frozen pre-trained layers, and $g$ represents the new trainable layers.

**2. Fine-Tuning (Unfrozen Layers)**

In fine-tuning, we unfreeze some or all of the pre-trained layers and continue training with a small learning rate. This allows the model to adapt the learned features to our specific task:

$$\theta_{final} = \theta_{pretrained} + \Delta\theta$$

where $\Delta\theta$ represents the updates to the pre-trained weights, typically with a learning rate $\alpha_{fine-tune} \ll \alpha_{initial}$.

### Layer-wise Feature Hierarchy

Deep neural networks learn hierarchical features:

- **Early layers**: Learn generic, low-level features (edges, colors, textures)
- **Middle layers**: Learn intermediate features (shapes, patterns)
- **Late layers**: Learn high-level, task-specific features

This hierarchy is why transfer learning works: early and middle layers learn features that are broadly applicable across many tasks.

### When to Use Which Approach

The choice between feature extraction and fine-tuning depends on:

1. **Dataset Size**:
   - Small dataset: Use feature extraction (avoid overfitting)
   - Large dataset: Fine-tuning is safer

2. **Similarity to Source Task**:
   - Very similar: Fine-tune top layers only
   - Very different: May need to fine-tune more layers or use as feature extractor

3. **Computational Resources**:
   - Limited: Feature extraction (faster)
   - Abundant: Fine-tuning (potentially better performance)

## Popular Pre-trained Models

Several architectures have become standard for transfer learning:

### VGG (Visual Geometry Group)

VGG networks are known for their simplicity and depth. VGG16 and VGG19 use small 3x3 convolutional filters stacked deeply. The architecture is:

$$\text{Input} \rightarrow \text{Conv Blocks} \rightarrow \text{Fully Connected} \rightarrow \text{Softmax}$$

where each Conv Block consists of multiple 3x3 convolutions followed by max pooling.

### ResNet (Residual Networks)

ResNet introduced skip connections to enable training very deep networks. A residual block computes:

$$y = F(x, \{W_i\}) + x$$

where $F(x, \{W_i\})$ represents the residual mapping and $x$ is the identity shortcut connection.

### MobileNet

MobileNet uses depthwise separable convolutions for efficiency:

$$\text{Depthwise Conv} + \text{Pointwise Conv} \approx \frac{1}{N} + \frac{1}{D_K^2}$$

This reduces computation compared to standard convolutions while maintaining accuracy.

### Inception/GoogLeNet

Inception modules process the input at multiple scales simultaneously using different kernel sizes, then concatenate the results.

## Python Implementation: Setting Up

Let's start by importing the necessary libraries and setting up our environment for transfer learning experiments.

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split

# TensorFlow and Keras for deep learning
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
from tensorflow.keras.applications import VGG16, ResNet50, MobileNetV2
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.utils import to_categorical

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Display versions
print(f"TensorFlow version: {tf.__version__}")
print(f"Keras version: {keras.__version__}")
print(f"NumPy version: {np.__version__}")

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

## Visualizing Transfer Learning Concepts

Let's create a visualization to understand how transfer learning works and the difference between feature extraction and fine-tuning.

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.patches import FancyBboxPatch, FancyArrowPatch

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Common elements
layer_height = 0.15
layer_width = 0.6
start_x = 0.2

# Function to draw a layer
def draw_layer(ax, y_pos, label, color, frozen=False):
    rect = FancyBboxPatch((start_x, y_pos), layer_width, layer_height, 
                          boxstyle="round,pad=0.01", 
                          edgecolor='black', 
                          facecolor=color, 
                          linewidth=2 if frozen else 1,
                          linestyle='--' if frozen else '-')
    ax.add_patch(rect)
    ax.text(start_x + layer_width/2, y_pos + layer_height/2, label, 
           ha='center', va='center', fontsize=9, weight='bold')
    if frozen:
        ax.text(start_x + layer_width + 0.05, y_pos + layer_height/2, 'ðŸ”’', 
               ha='left', va='center', fontsize=12)

# Subplot 1: Traditional Training
ax1 = axes[0]
ax1.set_xlim(0, 1)
ax1.set_ylim(0, 1)
ax1.axis('off')
ax1.set_title('Traditional Training\n(From Scratch)', fontsize=12, weight='bold')

y_positions = [0.7, 0.52, 0.34, 0.16]
labels = ['Input Layer', 'Conv Layers\n(Random Init)', 'Dense Layers\n(Random Init)', 'Output Layer']
colors = ['lightblue', 'lightcoral', 'lightcoral', 'lightgreen']

for y, label, color in zip(y_positions, labels, colors):
    draw_layer(ax1, y, label, color, frozen=False)

ax1.text(0.5, 0.05, 'All layers trained\nRequires large dataset', 
        ha='center', fontsize=9, style='italic')

# Subplot 2: Feature Extraction
ax2 = axes[1]
ax2.set_xlim(0, 1)
ax2.set_ylim(0, 1)
ax2.axis('off')
ax2.set_title('Transfer Learning\n(Feature Extraction)', fontsize=12, weight='bold')

labels2 = ['Input Layer', 'Pre-trained\nConv Layers', 'New Dense\nLayers', 'New Output\nLayer']
colors2 = ['lightblue', 'gold', 'lightcoral', 'lightgreen']
frozen2 = [False, True, False, False]

for y, label, color, frz in zip(y_positions, labels2, colors2, frozen2):
    draw_layer(ax2, y, label, color, frozen=frz)

ax2.text(0.5, 0.05, 'Pre-trained layers frozen\nOnly train new layers', 
        ha='center', fontsize=9, style='italic')

# Subplot 3: Fine-tuning
ax3 = axes[2]
ax3.set_xlim(0, 1)
ax3.set_ylim(0, 1)
ax3.axis('off')
ax3.set_title('Transfer Learning\n(Fine-tuning)', fontsize=12, weight='bold')

labels3 = ['Input Layer', 'Pre-trained\nConv Layers', 'New Dense\nLayers', 'New Output\nLayer']
colors3 = ['lightblue', 'lightyellow', 'lightcoral', 'lightgreen']
frozen3 = [False, False, False, False]

for y, label, color, frz in zip(y_positions, labels3, colors3, frozen3):
    draw_layer(ax3, y, label, color, frozen=frz)

ax3.text(0.5, 0.05, 'All layers trainable\nLow learning rate for pre-trained', 
        ha='center', fontsize=9, style='italic')

plt.tight_layout()
plt.savefig('transfer_learning_approaches.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nKey Differences:")
print("1. Traditional: All weights learned from scratch (requires lots of data)")
print("2. Feature Extraction: Use pre-trained features, train only new layers (fast, small datasets)")
print("3. Fine-tuning: Adapt pre-trained weights to new task (best performance, moderate datasets)")

## Loading Pre-trained Models

Let's explore how to load popular pre-trained models and examine their architectures. We'll use models trained on ImageNet, a dataset of 1.4 million images across 1000 categories.

In [None]:
# Load pre-trained models (without top classification layer)
# We set include_top=False to remove the final classification layer
# This allows us to add our own classifier for our specific task

print("Loading pre-trained models...\n")

# VGG16: 16-layer network (13 conv + 3 FC)
vgg_base = VGG16(weights='imagenet', 
                 include_top=False, 
                 input_shape=(224, 224, 3))
print(f"VGG16 loaded:")
print(f"  - Total layers: {len(vgg_base.layers)}")
print(f"  - Trainable parameters: {vgg_base.count_params():,}")
print(f"  - Output shape: {vgg_base.output_shape}\n")

# ResNet50: 50-layer residual network
resnet_base = ResNet50(weights='imagenet', 
                       include_top=False, 
                       input_shape=(224, 224, 3))
print(f"ResNet50 loaded:")
print(f"  - Total layers: {len(resnet_base.layers)}")
print(f"  - Trainable parameters: {resnet_base.count_params():,}")
print(f"  - Output shape: {resnet_base.output_shape}\n")

# MobileNetV2: Efficient architecture for mobile devices
mobilenet_base = MobileNetV2(weights='imagenet', 
                            include_top=False, 
                            input_shape=(224, 224, 3))
print(f"MobileNetV2 loaded:")
print(f"  - Total layers: {len(mobilenet_base.layers)}")
print(f"  - Trainable parameters: {mobilenet_base.count_params():,}")
print(f"  - Output shape: {mobilenet_base.output_shape}\n")

print("All models successfully loaded!")
print("\nNote: These models are pre-trained on ImageNet (1000 classes)")
print("We'll adapt them to our specific classification task.")

## Preparing Data for Transfer Learning

For this example, we'll use a subset of the CIFAR-10 dataset to demonstrate transfer learning. In practice, you would use your own domain-specific dataset.

In [None]:
# Load CIFAR-10 dataset for demonstration
# In practice, you would load your own dataset

print("Loading CIFAR-10 dataset...")
(x_train_full, y_train_full), (x_test_full, y_test_full) = keras.datasets.cifar10.load_data()

# Class names for CIFAR-10
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 
               'dog', 'frog', 'horse', 'ship', 'truck']

# For demonstration, let's use a binary classification task:
# Classify cats (class 3) vs dogs (class 5)
# This simulates a real-world scenario where you have a specific task

# Filter for cats and dogs only
train_mask = np.isin(y_train_full, [3, 5]).flatten()
test_mask = np.isin(y_test_full, [3, 5]).flatten()

x_train = x_train_full[train_mask]
y_train = y_train_full[train_mask]
x_test = x_test_full[test_mask]
y_test = y_test_full[test_mask]

# Convert labels: cat=0, dog=1
y_train = (y_train == 5).astype(int)
y_test = (y_test == 5).astype(int)

# Use only a subset to simulate limited data scenario
n_samples = 1000
indices = np.random.choice(len(x_train), n_samples, replace=False)
x_train_small = x_train[indices]
y_train_small = y_train[indices]

print(f"\nDataset prepared:")
print(f"  - Training samples: {len(x_train_small)}")
print(f"  - Test samples: {len(x_test)}")
print(f"  - Original image shape: {x_train_small.shape[1:]}")
print(f"  - Task: Binary classification (Cat vs Dog)")
print(f"\nClass distribution in training set:")
print(f"  - Cats: {np.sum(y_train_small == 0)}")
print(f"  - Dogs: {np.sum(y_train_small == 1)}")

## Visualizing the Dataset

Let's visualize some examples from our dataset to understand what we're working with.

In [None]:
# Visualize sample images
fig, axes = plt.subplots(2, 8, figsize=(16, 4))
fig.suptitle('Sample Images from Our Dataset', fontsize=14, weight='bold')

# Get 8 cats and 8 dogs
cat_indices = np.where(y_train_small == 0)[0][:8]
dog_indices = np.where(y_train_small == 1)[0][:8]

for i, idx in enumerate(cat_indices):
    axes[0, i].imshow(x_train_small[idx])
    axes[0, i].axis('off')
    if i == 0:
        axes[0, i].set_title('Cats', fontsize=12, weight='bold')

for i, idx in enumerate(dog_indices):
    axes[1, i].imshow(x_train_small[idx])
    axes[1, i].axis('off')
    if i == 0:
        axes[1, i].set_title('Dogs', fontsize=12, weight='bold')

plt.tight_layout()
plt.show()

print("Note: CIFAR-10 images are 32x32 pixels, quite small!")
print("Pre-trained models expect 224x224, so we'll need to resize.")

## Preprocessing for Pre-trained Models

Pre-trained models expect specific input formats. We need to:
1. Resize images to the expected input size (224x224 for most models)
2. Apply model-specific preprocessing (e.g., normalization)
3. Ensure proper data types

In [None]:
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input

# Function to preprocess images for transfer learning
def preprocess_images(images, target_size=(224, 224)):
    """
    Resize and preprocess images for pre-trained models
    """
    processed_images = []
    for img in images:
        # Resize using TensorFlow
        img_resized = tf.image.resize(img, target_size)
        processed_images.append(img_resized.numpy())
    
    processed_images = np.array(processed_images)
    # Apply model-specific preprocessing
    processed_images = preprocess_input(processed_images)
    return processed_images

print("Preprocessing images...")
x_train_processed = preprocess_images(x_train_small)
x_test_processed = preprocess_images(x_test)

print(f"\nPreprocessing complete:")
print(f"  - New shape: {x_train_processed.shape}")
print(f"  - Value range: [{x_train_processed.min():.2f}, {x_train_processed.max():.2f}]")
print(f"  - Data type: {x_train_processed.dtype}")

## Approach 1: Feature Extraction

In this approach, we freeze the pre-trained convolutional base and only train new classification layers. This is fast and works well with small datasets.

In [None]:
# Build a model using feature extraction
print("Building feature extraction model...\n")

# Use MobileNetV2 as the base
base_model = MobileNetV2(weights='imagenet', 
                        include_top=False, 
                        input_shape=(224, 224, 3))

# Freeze the base model
base_model.trainable = False

print(f"Base model trainable: {base_model.trainable}")
print(f"Number of trainable weights: {len(base_model.trainable_weights)}")

# Build the complete model
model_fe = models.Sequential([
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
], name='feature_extraction_model')

# Compile the model
model_fe.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

print("\nModel architecture:")
model_fe.summary()

# Count trainable vs non-trainable parameters
trainable_count = np.sum([keras.backend.count_params(w) for w in model_fe.trainable_weights])
non_trainable_count = np.sum([keras.backend.count_params(w) for w in model_fe.non_trainable_weights])

print(f"\nParameter breakdown:")
print(f"  - Trainable parameters: {trainable_count:,}")
print(f"  - Non-trainable parameters: {non_trainable_count:,}")
print(f"  - Total parameters: {trainable_count + non_trainable_count:,}")
print(f"  - Percentage trainable: {100 * trainable_count / (trainable_count + non_trainable_count):.2f}%")

In [None]:
# Train the feature extraction model
print("Training feature extraction model...\n")

history_fe = model_fe.fit(
    x_train_processed, y_train_small,
    batch_size=32,
    epochs=10,
    validation_split=0.2,
    verbose=1
)

print("\nTraining complete!")

## Approach 2: Fine-Tuning

In fine-tuning, we unfreeze some or all layers of the pre-trained model and continue training with a lower learning rate. This can achieve better performance but requires more careful tuning.

In [None]:
# Build a model for fine-tuning
print("Building fine-tuning model...\n")

# Create a new base model
base_model_ft = MobileNetV2(weights='imagenet', 
                           include_top=False, 
                           input_shape=(224, 224, 3))

# First, train with frozen base (same as feature extraction)
base_model_ft.trainable = False

# Build model
model_ft = models.Sequential([
    base_model_ft,
    layers.GlobalAveragePooling2D(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
], name='fine_tuning_model')

# Compile and train briefly with frozen base
model_ft.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

print("Phase 1: Training with frozen base...")
history_ft_phase1 = model_ft.fit(
    x_train_processed, y_train_small,
    batch_size=32,
    epochs=5,
    validation_split=0.2,
    verbose=0
)

print(f"Phase 1 complete. Final accuracy: {history_ft_phase1.history['accuracy'][-1]:.4f}")

# Now unfreeze the base model for fine-tuning
print("\nPhase 2: Unfreezing base model for fine-tuning...")
base_model_ft.trainable = True

# Let's unfreeze only the last 20 layers
# (fine-tuning only top layers is often more stable)
for layer in base_model_ft.layers[:-20]:
    layer.trainable = False

print(f"Layers in base model: {len(base_model_ft.layers)}")
print(f"Trainable layers: {sum([1 for layer in base_model_ft.layers if layer.trainable])}")

# Recompile with a lower learning rate
model_ft.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.0001),  # Lower learning rate!
    loss='binary_crossentropy',
    metrics=['accuracy']
)

print("\nTraining with fine-tuning...")
history_ft_phase2 = model_ft.fit(
    x_train_processed, y_train_small,
    batch_size=32,
    epochs=10,
    validation_split=0.2,
    verbose=1
)

print("\nFine-tuning complete!")

## Comparing Results

Let's compare the performance of both approaches and visualize the training history.

In [None]:
# Evaluate both models on test set
print("Evaluating models on test set...\n")

# Feature extraction model
loss_fe, acc_fe = model_fe.evaluate(x_test_processed, y_test, verbose=0)
print(f"Feature Extraction Model:")
print(f"  - Test Loss: {loss_fe:.4f}")
print(f"  - Test Accuracy: {acc_fe:.4f}")

# Fine-tuning model
loss_ft, acc_ft = model_ft.evaluate(x_test_processed, y_test, verbose=0)
print(f"\nFine-Tuning Model:")
print(f"  - Test Loss: {loss_ft:.4f}")
print(f"  - Test Accuracy: {acc_ft:.4f}")

# Compare improvement
improvement = (acc_ft - acc_fe) * 100
print(f"\nImprovement from fine-tuning: {improvement:+.2f}%")

In [None]:
# Visualize training history
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot accuracy
axes[0].plot(history_fe.history['accuracy'], label='FE Train', linewidth=2)
axes[0].plot(history_fe.history['val_accuracy'], label='FE Val', linewidth=2, linestyle='--')

# For fine-tuning, combine both phases
ft_train_acc = history_ft_phase1.history['accuracy'] + history_ft_phase2.history['accuracy']
ft_val_acc = history_ft_phase1.history['val_accuracy'] + history_ft_phase2.history['val_accuracy']
axes[0].plot(range(len(ft_train_acc)), ft_train_acc, label='FT Train', linewidth=2)
axes[0].plot(range(len(ft_val_acc)), ft_val_acc, label='FT Val', linewidth=2, linestyle='--')
axes[0].axvline(x=5, color='red', linestyle=':', label='Unfreeze', alpha=0.7)

axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].set_title('Model Accuracy Comparison', fontsize=14, weight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot loss
axes[1].plot(history_fe.history['loss'], label='FE Train', linewidth=2)
axes[1].plot(history_fe.history['val_loss'], label='FE Val', linewidth=2, linestyle='--')

ft_train_loss = history_ft_phase1.history['loss'] + history_ft_phase2.history['loss']
ft_val_loss = history_ft_phase1.history['val_loss'] + history_ft_phase2.history['val_loss']
axes[1].plot(range(len(ft_train_loss)), ft_train_loss, label='FT Train', linewidth=2)
axes[1].plot(range(len(ft_val_loss)), ft_val_loss, label='FT Val', linewidth=2, linestyle='--')
axes[1].axvline(x=5, color='red', linestyle=':', label='Unfreeze', alpha=0.7)

axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Loss', fontsize=12)
axes[1].set_title('Model Loss Comparison', fontsize=14, weight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('training_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nKey Observations:")
print("1. Feature extraction converges quickly (few parameters to train)")
print("2. Fine-tuning shows gradual improvement after unfreezing")
print("3. The vertical red line shows when we unfroze layers for fine-tuning")

## Detailed Performance Analysis

Let's create a more detailed analysis of model performance including confusion matrices and classification reports.

In [None]:
# Make predictions
y_pred_fe = (model_fe.predict(x_test_processed, verbose=0) > 0.5).astype(int)
y_pred_ft = (model_ft.predict(x_test_processed, verbose=0) > 0.5).astype(int)

# Create confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Feature extraction confusion matrix
cm_fe = confusion_matrix(y_test, y_pred_fe)
sns.heatmap(cm_fe, annot=True, fmt='d', cmap='Blues', ax=axes[0],
           xticklabels=['Cat', 'Dog'], yticklabels=['Cat', 'Dog'])
axes[0].set_title('Feature Extraction\nConfusion Matrix', fontsize=12, weight='bold')
axes[0].set_ylabel('True Label', fontsize=11)
axes[0].set_xlabel('Predicted Label', fontsize=11)

# Fine-tuning confusion matrix
cm_ft = confusion_matrix(y_test, y_pred_ft)
sns.heatmap(cm_ft, annot=True, fmt='d', cmap='Greens', ax=axes[1],
           xticklabels=['Cat', 'Dog'], yticklabels=['Cat', 'Dog'])
axes[1].set_title('Fine-Tuning\nConfusion Matrix', fontsize=12, weight='bold')
axes[1].set_ylabel('True Label', fontsize=11)
axes[1].set_xlabel('Predicted Label', fontsize=11)

plt.tight_layout()
plt.savefig('confusion_matrices.png', dpi=150, bbox_inches='tight')
plt.show()

# Print classification reports
print("\n" + "="*60)
print("Feature Extraction Model - Classification Report")
print("="*60)
print(classification_report(y_test, y_pred_fe, target_names=['Cat', 'Dog']))

print("\n" + "="*60)
print("Fine-Tuning Model - Classification Report")
print("="*60)
print(classification_report(y_test, y_pred_ft, target_names=['Cat', 'Dog']))

## Visualizing Predictions

Let's visualize some predictions from our fine-tuned model to see where it succeeds and where it fails.

In [None]:
# Get predictions with probabilities
y_pred_proba = model_ft.predict(x_test_processed, verbose=0).flatten()

# Find some interesting cases
correct_mask = (y_pred_ft.flatten() == y_test.flatten())
incorrect_indices = np.where(~correct_mask)[0]
correct_confident_indices = np.where(correct_mask & (np.abs(y_pred_proba - 0.5) > 0.4))[0]

# Visualize
fig, axes = plt.subplots(3, 8, figsize=(16, 6))
fig.suptitle('Model Predictions Analysis', fontsize=14, weight='bold')

# Row 1: Correct and confident predictions
for i in range(8):
    if i < len(correct_confident_indices):
        idx = correct_confident_indices[i]
        # Get original image (before preprocessing)
        orig_img_idx = np.where((x_test == x_test[idx]).all(axis=(1,2,3)))[0][0] if len(x_test) > idx else idx
        axes[0, i].imshow(x_test[orig_img_idx])
        true_label = 'Dog' if y_test[idx] == 1 else 'Cat'
        conf = y_pred_proba[idx] if y_test[idx] == 1 else 1 - y_pred_proba[idx]
        axes[0, i].set_title(f'{true_label}\n{conf:.2%}', fontsize=9, color='green', weight='bold')
    axes[0, i].axis('off')
if len(correct_confident_indices) > 0:
    axes[0, 0].text(-0.3, 0.5, 'Correct\n(Confident)', transform=axes[0, 0].transAxes,
                   fontsize=11, weight='bold', va='center', rotation=90)

# Row 2: Correct but uncertain predictions
uncertain_correct = np.where(correct_mask & (np.abs(y_pred_proba - 0.5) < 0.2))[0]
for i in range(8):
    if i < len(uncertain_correct):
        idx = uncertain_correct[i]
        axes[1, i].imshow(x_test[idx])
        true_label = 'Dog' if y_test[idx] == 1 else 'Cat'
        conf = y_pred_proba[idx] if y_test[idx] == 1 else 1 - y_pred_proba[idx]
        axes[1, i].set_title(f'{true_label}\n{conf:.2%}', fontsize=9, color='orange', weight='bold')
    axes[1, i].axis('off')
if len(uncertain_correct) > 0:
    axes[1, 0].text(-0.3, 0.5, 'Correct\n(Uncertain)', transform=axes[1, 0].transAxes,
                   fontsize=11, weight='bold', va='center', rotation=90)

# Row 3: Incorrect predictions
for i in range(min(8, len(incorrect_indices))):
    idx = incorrect_indices[i]
    axes[2, i].imshow(x_test[idx])
    true_label = 'Dog' if y_test[idx] == 1 else 'Cat'
    pred_label = 'Dog' if y_pred_ft[idx] == 1 else 'Cat'
    conf = y_pred_proba[idx] if y_pred_ft[idx] == 1 else 1 - y_pred_proba[idx]
    axes[2, i].set_title(f'True: {true_label}\nPred: {pred_label} ({conf:.2%})', 
                        fontsize=8, color='red', weight='bold')
    axes[2, i].axis('off')
for i in range(len(incorrect_indices), 8):
    axes[2, i].axis('off')
if len(incorrect_indices) > 0:
    axes[2, 0].text(-0.3, 0.5, 'Incorrect', transform=axes[2, 0].transAxes,
                   fontsize=11, weight='bold', va='center', rotation=90)

plt.tight_layout()
plt.savefig('prediction_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\nPrediction Summary:")
print(f"  - Total test samples: {len(y_test)}")
print(f"  - Correct predictions: {np.sum(correct_mask)} ({100*np.sum(correct_mask)/len(y_test):.1f}%)")
print(f"  - Incorrect predictions: {len(incorrect_indices)} ({100*len(incorrect_indices)/len(y_test):.1f}%)")
print(f"  - High confidence (>90%): {np.sum(np.abs(y_pred_proba - 0.5) > 0.4)} samples")
print(f"  - Low confidence (<60%): {np.sum(np.abs(y_pred_proba - 0.5) < 0.1)} samples")

## Best Practices for Transfer Learning

Based on our experiments and theoretical understanding, here are key best practices:

### 1. **Choosing a Pre-trained Model**

- **Task similarity**: Choose models trained on similar domains (e.g., ImageNet for general vision tasks)
- **Model size**: Balance between performance and computational constraints
- **Architecture**: Consider modern architectures (ResNet, EfficientNet, Vision Transformers)

### 2. **Data Preprocessing**

- Use the **same preprocessing** as the original model training
- Pay attention to normalization schemes (some models use different ranges)
- Consider **data augmentation** to improve generalization

### 3. **Training Strategy**

**For small datasets (< 10,000 samples)**:
$$\text{Strategy: Feature Extraction}$$
- Freeze all pre-trained layers
- Train only new classifier layers
- Use standard learning rate (0.001)

**For medium datasets (10,000 - 100,000 samples)**:
$$\text{Strategy: Partial Fine-tuning}$$
- Freeze early layers
- Fine-tune top layers
- Use lower learning rate (0.0001)

**For large datasets (> 100,000 samples)**:
$$\text{Strategy: Full Fine-tuning}$$
- Fine-tune all layers
- Use very low learning rate (0.00001)
- May even consider training from scratch

### 4. **Learning Rate Selection**

Critical guideline for fine-tuning:

$$\alpha_{fine-tune} = \frac{\alpha_{scratch}}{10} \text{ to } \frac{\alpha_{scratch}}{100}$$

Use lower learning rates to avoid destroying pre-trained features.

### 5. **Regularization**

- Use **dropout** in new layers (0.3-0.5)
- Apply **L2 regularization** if overfitting occurs
- Consider **early stopping** based on validation loss
- Use **data augmentation** when possible

### 6. **Monitoring Training**

Watch for these signs:
- **Underfitting**: Both train and val accuracy are low â†’ Unfreeze more layers or increase capacity
- **Overfitting**: Train accuracy high but val accuracy low â†’ Add regularization, reduce capacity, or get more data
- **Good fit**: Both accuracies high and close together â†’ Model is working well!

## Hands-On Exercise: Build Your Own Transfer Learning Model

Now it's your turn! Try building a transfer learning model with different configurations.

### Exercise Tasks:

1. **Try a different pre-trained model**: Use VGG16 or ResNet50 instead of MobileNetV2
2. **Experiment with architecture**: Add more dense layers or change dropout rates
3. **Adjust fine-tuning strategy**: Try unfreezing different numbers of layers
4. **Data augmentation**: Implement data augmentation to improve performance
5. **Multi-class classification**: Extend to classify all 10 CIFAR-10 classes instead of just cats vs dogs

Here's a template to get you started:

In [None]:
# Exercise Template: Build your own transfer learning model

# TODO: Choose a different base model
# Options: VGG16, ResNet50, InceptionV3, EfficientNetB0
base_model_exercise = VGG16(  # Try changing this!
    weights='imagenet',
    include_top=False,
    input_shape=(224, 224, 3)
)

# TODO: Decide which layers to freeze
# Experiment with:
# - Freezing all layers: base_model_exercise.trainable = False
# - Freezing only bottom layers: for layer in base_model_exercise.layers[:X]: layer.trainable = False
# - No freezing: base_model_exercise.trainable = True

base_model_exercise.trainable = False  # Start here, then experiment!

# TODO: Design your classifier architecture
model_exercise = models.Sequential([
    base_model_exercise,
    layers.GlobalAveragePooling2D(),
    
    # Add your layers here!
    # Ideas:
    # - layers.Dense(256, activation='relu')
    # - layers.BatchNormalization()
    # - layers.Dropout(0.5)
    # - layers.Dense(128, activation='relu')
    
    layers.Dense(1, activation='sigmoid')  # For binary classification
], name='my_transfer_learning_model')

# TODO: Choose optimizer and learning rate
model_exercise.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),  # Experiment with this!
    loss='binary_crossentropy',
    metrics=['accuracy']
)

print("\nYour model:")
model_exercise.summary()

# TODO: Train your model
# Experiment with:
# - Different batch sizes
# - Different number of epochs
# - Data augmentation (use ImageDataGenerator)

print("\nReady to train! Uncomment the following lines to start training:")
print("""\n# history_exercise = model_exercise.fit(
#     x_train_processed, y_train_small,
#     batch_size=32,
#     epochs=10,
#     validation_split=0.2,
#     verbose=1
# )""")

## Advanced: Data Augmentation for Better Performance

Data augmentation is a powerful technique to improve model generalization, especially important when working with limited data.

In [None]:
# Create data augmentation pipeline
data_augmentation = keras.Sequential([
    layers.RandomFlip("horizontal"),
    layers.RandomRotation(0.1),
    layers.RandomZoom(0.1),
    layers.RandomContrast(0.1),
], name='data_augmentation')

# Visualize augmentation effects
sample_image = x_train_processed[0:1]

fig, axes = plt.subplots(2, 4, figsize=(12, 6))
fig.suptitle('Data Augmentation Examples', fontsize=14, weight='bold')

for i in range(8):
    ax = axes[i // 4, i % 4]
    augmented = data_augmentation(sample_image, training=True)
    # Denormalize for visualization
    img_display = augmented[0].numpy()
    img_display = (img_display - img_display.min()) / (img_display.max() - img_display.min())
    ax.imshow(img_display)
    ax.axis('off')
    ax.set_title(f'Augmentation {i+1}', fontsize=10)

plt.tight_layout()
plt.savefig('data_augmentation.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nData augmentation techniques applied:")
print("  - Random horizontal flips")
print("  - Random rotations (Â±10%)")
print("  - Random zoom (Â±10%)")
print("  - Random contrast adjustments (Â±10%)")
print("\nThese help the model learn more robust features!")

## Key Takeaways

Congratulations on completing this comprehensive lesson on transfer learning! Here are the key points to remember:

### Core Concepts

1. **Transfer Learning Fundamentals**:
   - Reuses knowledge from pre-trained models trained on large datasets
   - Particularly powerful when you have limited training data
   - Based on the idea that features learned for one task can be useful for related tasks

2. **Two Main Approaches**:
   - **Feature Extraction**: Freeze pre-trained layers, train only new classifier (fast, safe for small data)
   - **Fine-Tuning**: Continue training pre-trained layers with low learning rate (better performance, needs more data)

3. **When to Use What**:
   - Small dataset + similar task â†’ Feature extraction
   - Small dataset + different task â†’ Feature extraction with more new layers
   - Large dataset + similar task â†’ Fine-tuning
   - Large dataset + different task â†’ Fine-tuning or training from scratch

### Practical Skills Gained

- Loading and using pre-trained models from Keras/TensorFlow
- Freezing and unfreezing layers for different training strategies
- Properly preprocessing data for pre-trained models
- Implementing both feature extraction and fine-tuning approaches
- Evaluating and comparing model performance
- Understanding the trade-offs between different approaches

### Mathematical Insights

- Pre-trained models learn hierarchical features from simple to complex
- Fine-tuning requires much lower learning rates than training from scratch
- The choice of how many layers to freeze depends on data size and task similarity

### Best Practices

- Always use the same preprocessing as the original pre-trained model
- Start with frozen base and new classifier, then fine-tune if needed
- Use learning rates 10-100x lower for fine-tuning
- Monitor both training and validation metrics to detect overfitting
- Consider data augmentation to improve generalization
- Choose models based on task requirements and computational constraints

## Further Resources

To deepen your understanding of transfer learning and continue your journey in deep learning:

### Essential Reading

1. **Research Papers**:
   - ["A Survey on Transfer Learning"](https://www.cse.ust.hk/~qyang/Docs/2009/tkde_transfer_learning.pdf) by Pan & Yang (2010) - Comprehensive overview
   - ["Deep Residual Learning for Image Recognition"](https://arxiv.org/abs/1512.03385) - The ResNet paper
   - ["Very Deep Convolutional Networks for Large-Scale Image Recognition"](https://arxiv.org/abs/1409.1556) - The VGG paper
   - ["Rethinking the Inception Architecture"](https://arxiv.org/abs/1512.00567) - Inception v3 and transfer learning insights

2. **Official Documentation**:
   - [TensorFlow Transfer Learning Guide](https://www.tensorflow.org/tutorials/images/transfer_learning)
   - [Keras Applications Documentation](https://keras.io/api/applications/)
   - [PyTorch Transfer Learning Tutorial](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html)

3. **Books**:
   - "Deep Learning with Python" by FranÃ§ois Chollet (creator of Keras)
   - "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by AurÃ©lien GÃ©ron
   - "Deep Learning" by Goodfellow, Bengio, and Courville (free online)

### Online Courses and Tutorials

4. **Video Tutorials**:
   - [Stanford CS231n: Convolutional Neural Networks for Visual Recognition](http://cs231n.stanford.edu/)
   - [Fast.ai Practical Deep Learning for Coders](https://course.fast.ai/)
   - [DeepLearning.AI TensorFlow Developer Specialization](https://www.coursera.org/professional-certificates/tensorflow-in-practice)

5. **Interactive Resources**:
   - [TensorFlow Hub](https://tfhub.dev/) - Repository of pre-trained models
   - [Hugging Face Model Hub](https://huggingface.co/models) - Pre-trained models for NLP and vision
   - [Papers with Code](https://paperswithcode.com/methods/category/transfer-learning) - Latest research with implementations

### Practical Applications

6. **Datasets for Practice**:
   - [Kaggle Datasets](https://www.kaggle.com/datasets) - Real-world datasets
   - [TensorFlow Datasets](https://www.tensorflow.org/datasets) - Ready-to-use datasets
   - [ImageNet](https://www.image-net.org/) - The dataset most pre-trained models are trained on

7. **Model Zoos and Pre-trained Models**:
   - [TensorFlow Model Garden](https://github.com/tensorflow/models)
   - [ONNX Model Zoo](https://github.com/onnx/models)
   - [Timm (PyTorch Image Models)](https://github.com/rwightman/pytorch-image-models)

### Community and Discussion

8. **Forums and Communities**:
   - [r/MachineLearning](https://www.reddit.com/r/MachineLearning/) - Reddit community
   - [Stack Overflow](https://stackoverflow.com/questions/tagged/transfer-learning) - Q&A
   - [Cross Validated](https://stats.stackexchange.com/) - Statistical ML discussions

### Next Steps in Your Learning Journey

- Experiment with different pre-trained models on your own datasets
- Explore transfer learning in other domains (NLP, audio, time series)
- Learn about advanced techniques like multi-task learning and meta-learning
- Study domain adaptation for when source and target domains differ significantly
- Investigate how to create your own pre-trained models for specific domains

## Conclusion

Transfer learning has revolutionized how we approach machine learning problems, making state-of-the-art performance accessible even with limited data and computational resources. By leveraging pre-trained models, you can build powerful applications without training from scratch.

Remember: the key to successful transfer learning is understanding your data, choosing the right strategy, and carefully monitoring your model's performance. Don't be afraid to experiment with different approaches!

**Happy learning, and good luck with your transfer learning projects!** ðŸš€