# Module 10: Transfer Learning Concepts

**Difficulty**: ⭐⭐ (Intermediate)

**Estimated Time**: 45-60 minutes

**Prerequisites**: 
- [Module 05: Feed-Forward Neural Networks with Keras](05_feedforward_neural_networks_keras.ipynb)
- [Module 07: Regularization Techniques](07_regularization_techniques.ipynb)
- [Module 09: Hyperparameter Tuning](09_hyperparameter_tuning.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand what transfer learning is and why it works
2. Differentiate between feature extraction and fine-tuning approaches
3. Use pre-trained models from Keras Applications
4. Freeze and unfreeze layers for controlled training
5. Apply transfer learning to small datasets effectively
6. Decide when transfer learning is appropriate for your problem

## 1. What is Transfer Learning?

**Transfer Learning** is the practice of using knowledge gained from solving one problem and applying it to a different but related problem.

### The Core Idea

Instead of training a neural network from scratch, we:
1. Take a model pre-trained on a large dataset (e.g., ImageNet with 1.2M images)
2. Adapt it to our specific task (which might have only 1,000 images)
3. Leverage the learned features (edges, textures, patterns) from the pre-trained model

### Why Transfer Learning Works

Neural networks learn hierarchical features:
- **Early layers**: Generic features (edges, colors, textures)
- **Middle layers**: Pattern combinations (shapes, object parts)
- **Late layers**: Task-specific features (entire objects, classes)

The generic features learned on ImageNet are useful for many vision tasks!

### Benefits of Transfer Learning

1. **Requires less data**: Works well with small datasets
2. **Trains faster**: Pre-trained features reduce training time
3. **Better performance**: Often achieves higher accuracy than training from scratch
4. **Reduces computational cost**: No need for massive compute resources

## 2. Setup and Imports

In [None]:
# Core libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Deep learning libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.applications import (
    VGG16, ResNet50, MobileNetV2,
    vgg16, resnet50, mobilenet_v2
)

# Dataset
from tensorflow.keras.datasets import cifar10

# Utilities
from sklearn.model_selection import train_test_split

# For reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Plotting configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')

print(f"TensorFlow version: {tf.__version__}")
print(f"Keras version: {keras.__version__}")

## 3. Load and Prepare Data

We'll use CIFAR-10, but simulate a small dataset scenario by using only a subset.

In [None]:
# Load CIFAR-10 dataset
(X_train_full, y_train_full), (X_test, y_test) = cifar10.load_data()

# Class names for CIFAR-10
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

# Simulate small dataset: use only 1000 training samples
# This demonstrates transfer learning's power on limited data
n_samples = 1000
indices = np.random.choice(len(X_train_full), n_samples, replace=False)
X_train_small = X_train_full[indices]
y_train_small = y_train_full[indices]

# Create validation split
X_train, X_val, y_train, y_val = train_test_split(
    X_train_small, y_train_small, 
    test_size=0.2, 
    random_state=42,
    stratify=y_train_small
)

# Normalize to [0, 1] range
X_train = X_train.astype('float32') / 255.0
X_val = X_val.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0

# Flatten labels
y_train = y_train.flatten()
y_val = y_val.flatten()
y_test = y_test.flatten()

print(f"Training set: {X_train.shape}")
print(f"Validation set: {X_val.shape}")
print(f"Test set: {X_test.shape}")
print(f"Image shape: {X_train[0].shape}")
print(f"Number of classes: {len(class_names)}")

In [None]:
# Visualize some training examples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
axes = axes.ravel()

for i in range(10):
    axes[i].imshow(X_train[i])
    axes[i].set_title(class_names[y_train[i]], fontsize=10)
    axes[i].axis('off')

plt.suptitle('Sample Training Images', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

## 4. Baseline: Training from Scratch

First, let's train a model from scratch to establish a baseline performance.

In [None]:
def create_baseline_model():
    """
    Create a simple CNN trained from scratch.
    """
    model = keras.Sequential([
        layers.InputLayer(input_shape=(32, 32, 3)),
        
        # First conv block
        layers.Conv2D(32, (3, 3), activation='relu', padding='same'),
        layers.MaxPooling2D((2, 2)),
        
        # Second conv block
        layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
        layers.MaxPooling2D((2, 2)),
        
        # Third conv block
        layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
        layers.MaxPooling2D((2, 2)),
        
        # Dense layers
        layers.Flatten(),
        layers.Dense(128, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(10, activation='softmax')
    ])
    
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

# Create and display baseline model
baseline_model = create_baseline_model()
print("Baseline Model Architecture:")
baseline_model.summary()

In [None]:
# Train baseline model
print("Training baseline model from scratch...")
baseline_history = baseline_model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=20,
    batch_size=32,
    verbose=0
)

# Evaluate on test set
baseline_test_loss, baseline_test_acc = baseline_model.evaluate(X_test, y_test, verbose=0)
print(f"\nBaseline Test Accuracy: {baseline_test_acc:.4f}")

## 5. Understanding Pre-trained Models

Keras provides several pre-trained models through `keras.applications`:

### Popular Pre-trained Models:

1. **VGG16/VGG19**: Simple architecture, large model size
2. **ResNet50/ResNet101**: Skip connections, deeper networks
3. **MobileNetV2**: Lightweight, efficient for mobile/edge devices
4. **InceptionV3**: Multi-scale feature extraction
5. **EfficientNet**: State-of-the-art accuracy with efficiency

All are trained on **ImageNet** (1.2M images, 1000 classes).

In [None]:
# Load pre-trained VGG16 model
# include_top=False removes the final classification layer
# weights='imagenet' loads ImageNet pre-trained weights
base_model = VGG16(
    include_top=False,
    weights='imagenet',
    input_shape=(32, 32, 3)
)

print("Pre-trained VGG16 Base Model:")
print(f"Number of layers: {len(base_model.layers)}")
print(f"Total parameters: {base_model.count_params():,}")
print(f"Output shape: {base_model.output_shape}")

## 6. Transfer Learning Approach 1: Feature Extraction

**Feature Extraction** means:
1. Freeze all pre-trained layers (don't update their weights)
2. Add new trainable layers on top
3. Train only the new layers

**When to use**: 
- Small dataset (< 10,000 samples)
- Similar to ImageNet images
- Limited computational resources

In [None]:
def create_feature_extraction_model(base_model):
    """
    Create transfer learning model using feature extraction.
    Base model layers are frozen.
    """
    # Freeze all layers in base model
    base_model.trainable = False
    
    # Add custom classification head
    model = keras.Sequential([
        base_model,
        layers.GlobalAveragePooling2D(),  # Convert feature maps to single vector
        layers.Dense(256, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(10, activation='softmax')
    ])
    
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

# Create feature extraction model
feature_extraction_model = create_feature_extraction_model(base_model)

# Check trainable parameters
print("Feature Extraction Model:")
print(f"Total parameters: {feature_extraction_model.count_params():,}")
print(f"Trainable parameters: {sum([tf.size(w).numpy() for w in feature_extraction_model.trainable_weights]):,}")
print(f"Non-trainable parameters: {sum([tf.size(w).numpy() for w in feature_extraction_model.non_trainable_weights]):,}")

In [None]:
# Train feature extraction model
print("Training feature extraction model...")
feature_history = feature_extraction_model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=20,
    batch_size=32,
    verbose=0
)

# Evaluate on test set
feature_test_loss, feature_test_acc = feature_extraction_model.evaluate(X_test, y_test, verbose=0)
print(f"\nFeature Extraction Test Accuracy: {feature_test_acc:.4f}")
print(f"Improvement over baseline: {(feature_test_acc - baseline_test_acc) * 100:+.2f}%")

## 7. Transfer Learning Approach 2: Fine-Tuning

**Fine-Tuning** means:
1. Start with pre-trained weights
2. Freeze early layers (generic features)
3. Unfreeze later layers (task-specific features)
4. Train with a small learning rate

**When to use**:
- Medium dataset (10,000 - 100,000 samples)
- Want to maximize performance
- Have sufficient computational resources

**Best Practice**: Fine-tune in two stages:
1. Train top layers with base frozen
2. Unfreeze some base layers and train with low LR

In [None]:
# Create a fresh base model for fine-tuning
base_model_finetune = VGG16(
    include_top=False,
    weights='imagenet',
    input_shape=(32, 32, 3)
)

# Initially freeze all layers
base_model_finetune.trainable = False

# Build model with custom head
finetune_model = keras.Sequential([
    base_model_finetune,
    layers.GlobalAveragePooling2D(),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(10, activation='softmax')
])

finetune_model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Stage 1: Train top layers only
print("Stage 1: Training custom layers...")
finetune_model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=10,
    batch_size=32,
    verbose=0
)

print("Stage 1 complete.")

In [None]:
# Stage 2: Unfreeze last few layers and fine-tune
print("\nStage 2: Fine-tuning last layers...")

# Unfreeze the base model
base_model_finetune.trainable = True

# Freeze all layers except the last 4
for layer in base_model_finetune.layers[:-4]:
    layer.trainable = False

# Check which layers are trainable
print("Trainable layers:")
for i, layer in enumerate(base_model_finetune.layers):
    if layer.trainable:
        print(f"  Layer {i}: {layer.name}")

# Recompile with lower learning rate
# Lower LR is critical for fine-tuning to avoid destroying pre-trained weights
finetune_model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-5),  # Much smaller LR!
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Fine-tune
finetune_history = finetune_model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=10,
    batch_size=32,
    verbose=0
)

# Evaluate
finetune_test_loss, finetune_test_acc = finetune_model.evaluate(X_test, y_test, verbose=0)
print(f"\nFine-tuned Test Accuracy: {finetune_test_acc:.4f}")
print(f"Improvement over baseline: {(finetune_test_acc - baseline_test_acc) * 100:+.2f}%")
print(f"Improvement over feature extraction: {(finetune_test_acc - feature_test_acc) * 100:+.2f}%")

## 8. Comparing All Approaches

In [None]:
# Create comparison visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot training accuracy
axes[0].plot(baseline_history.history['accuracy'], label='Baseline', linewidth=2)
axes[0].plot(feature_history.history['accuracy'], label='Feature Extraction', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=11)
axes[0].set_ylabel('Training Accuracy', fontsize=11)
axes[0].set_title('Training Accuracy Comparison', fontsize=12, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Plot validation accuracy
axes[1].plot(baseline_history.history['val_accuracy'], label='Baseline', linewidth=2)
axes[1].plot(feature_history.history['val_accuracy'], label='Feature Extraction', linewidth=2)
axes[1].set_xlabel('Epoch', fontsize=11)
axes[1].set_ylabel('Validation Accuracy', fontsize=11)
axes[1].set_title('Validation Accuracy Comparison', fontsize=12, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Test accuracy comparison
models = ['Baseline\n(From Scratch)', 'Feature\nExtraction', 'Fine-Tuning']
accuracies = [baseline_test_acc, feature_test_acc, finetune_test_acc]

plt.figure(figsize=(10, 6))
bars = plt.bar(models, accuracies, color=['#e74c3c', '#3498db', '#2ecc71'], alpha=0.8)
plt.ylabel('Test Accuracy', fontsize=12)
plt.title('Transfer Learning Methods Comparison', fontsize=14, fontweight='bold')
plt.ylim([0, 1])

# Add value labels on bars
for bar, acc in zip(bars, accuracies):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{acc:.4f}',
             ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nSummary:")
print(f"Baseline (from scratch):  {baseline_test_acc:.4f}")
print(f"Feature Extraction:       {feature_test_acc:.4f} ({(feature_test_acc - baseline_test_acc) * 100:+.2f}%)")
print(f"Fine-Tuning:              {finetune_test_acc:.4f} ({(finetune_test_acc - baseline_test_acc) * 100:+.2f}%)")

## 9. Using Different Pre-trained Models

Let's compare different pre-trained architectures.

In [None]:
def create_transfer_model(base_model_name='VGG16'):
    """
    Create transfer learning model with specified base.
    
    Args:
        base_model_name: Name of pre-trained model to use
    
    Returns:
        Compiled Keras model
    """
    # Select base model
    if base_model_name == 'VGG16':
        base = VGG16(include_top=False, weights='imagenet', input_shape=(32, 32, 3))
    elif base_model_name == 'ResNet50':
        base = ResNet50(include_top=False, weights='imagenet', input_shape=(32, 32, 3))
    elif base_model_name == 'MobileNetV2':
        base = MobileNetV2(include_top=False, weights='imagenet', input_shape=(32, 32, 3))
    else:
        raise ValueError(f"Unknown base model: {base_model_name}")
    
    # Freeze base
    base.trainable = False
    
    # Build model
    model = keras.Sequential([
        base,
        layers.GlobalAveragePooling2D(),
        layers.Dense(128, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(10, activation='softmax')
    ])
    
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

# Compare different architectures
architectures = ['VGG16', 'ResNet50', 'MobileNetV2']
architecture_results = {}

for arch in architectures:
    print(f"\nTesting {arch}...")
    
    model = create_transfer_model(arch)
    
    # Train briefly
    history = model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        epochs=10,
        batch_size=32,
        verbose=0
    )
    
    # Evaluate
    test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
    
    architecture_results[arch] = {
        'test_accuracy': test_acc,
        'parameters': model.count_params()
    }
    
    print(f"  Test Accuracy: {test_acc:.4f}")
    print(f"  Parameters: {model.count_params():,}")

In [None]:
# Visualize architecture comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
arch_names = list(architecture_results.keys())
accuracies = [architecture_results[arch]['test_accuracy'] for arch in arch_names]

axes[0].bar(arch_names, accuracies, color=['#3498db', '#e74c3c', '#2ecc71'], alpha=0.8)
axes[0].set_ylabel('Test Accuracy', fontsize=11)
axes[0].set_title('Accuracy by Architecture', fontsize=12, fontweight='bold')
axes[0].set_ylim([0, 1])
axes[0].grid(True, alpha=0.3, axis='y')

# Add value labels
for i, (name, acc) in enumerate(zip(arch_names, accuracies)):
    axes[0].text(i, acc, f'{acc:.3f}', ha='center', va='bottom', fontweight='bold')

# Parameter count comparison
params = [architecture_results[arch]['parameters'] / 1e6 for arch in arch_names]  # In millions

axes[1].bar(arch_names, params, color=['#3498db', '#e74c3c', '#2ecc71'], alpha=0.8)
axes[1].set_ylabel('Parameters (Millions)', fontsize=11)
axes[1].set_title('Model Size by Architecture', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')

# Add value labels
for i, (name, param) in enumerate(zip(arch_names, params)):
    axes[1].text(i, param, f'{param:.1f}M', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

## 10. When to Use Transfer Learning: Decision Guide

### ✅ Use Transfer Learning When:

1. **Limited Training Data**
   - You have < 10,000 labeled examples
   - Collecting more data is expensive or time-consuming

2. **Similar Domain**
   - Your images are natural images (like ImageNet)
   - Tasks like object recognition, scene classification

3. **Resource Constraints**
   - Limited computational budget
   - Need faster training times

4. **Quick Prototyping**
   - Want to establish baseline quickly
   - Exploring feasibility of deep learning approach

### ❌ Consider Training from Scratch When:

1. **Very Different Domain**
   - Medical images (X-rays, MRI)
   - Satellite imagery
   - Specialized scientific images

2. **Massive Dataset Available**
   - You have > 1M labeled examples
   - Enough data to learn from scratch

3. **Completely Different Task**
   - Image generation (GANs)
   - Super-resolution
   - Novel architectures needed

### Feature Extraction vs Fine-Tuning Decision:

| Criterion | Feature Extraction | Fine-Tuning |
|-----------|-------------------|-------------|
| **Dataset Size** | < 10,000 | > 10,000 |
| **Similarity to ImageNet** | Very similar | Somewhat similar |
| **Computational Budget** | Low | Medium-High |
| **Training Time** | Fast | Slower |
| **Peak Accuracy** | Good | Best |

## 11. Exercise 1: Transfer Learning with Your Own Data

**Task**: Apply transfer learning to a binary classification problem.

**Requirements**:
1. Select 2 classes from CIFAR-10 (e.g., cats vs dogs)
2. Create a small dataset (200 samples per class)
3. Build both feature extraction and fine-tuning models
4. Compare performance with baseline model
5. Visualize training curves and final results

In [None]:
# YOUR CODE HERE
# Hint: Filter CIFAR-10 to keep only 2 classes
# Use np.isin() to filter labels

pass  # Replace with your implementation

## 12. Exercise 2: Layer Freezing Strategy

**Task**: Experiment with different layer freezing strategies.

**Requirements**:
1. Load a pre-trained model (VGG16 or ResNet50)
2. Try freezing different numbers of layers:
   - All layers frozen (feature extraction)
   - Last 25% unfrozen
   - Last 50% unfrozen
   - All layers unfrozen
3. Train each variant for the same number of epochs
4. Compare accuracy vs training time
5. Determine the optimal freezing strategy for this dataset

In [None]:
# YOUR CODE HERE
# Hint: Use a loop to iterate over different freeze percentages
# Track both accuracy and training time

pass  # Replace with your implementation

## 13. Exercise 3: Pre-trained Model Comparison

**Task**: Compare at least 3 different pre-trained models systematically.

**Requirements**:
1. Select 3 models from keras.applications (e.g., VGG16, ResNet50, MobileNetV2, InceptionV3)
2. For each model:
   - Record number of parameters
   - Measure training time per epoch
   - Record best validation accuracy
   - Measure inference time (prediction speed)
3. Create a comprehensive comparison table
4. Visualize the trade-offs between accuracy, size, and speed
5. Make a recommendation based on different scenarios (accuracy vs speed vs size)

In [None]:
# YOUR CODE HERE
# Hint: Use time.time() to measure training and inference time
# Create scatter plots showing trade-offs

pass  # Replace with your implementation

## 14. Summary

### Key Concepts Covered:

1. **Transfer Learning Fundamentals**
   - Using pre-trained models for new tasks
   - Why it works: hierarchical feature learning
   - Benefits: less data, faster training, better performance

2. **Feature Extraction**
   - Freeze pre-trained layers completely
   - Train only custom classification head
   - Best for small datasets and quick prototyping

3. **Fine-Tuning**
   - Unfreeze later layers and train with low learning rate
   - Two-stage process for best results
   - Achieves highest accuracy but requires more data

4. **Pre-trained Model Zoo**
   - VGG: Simple, large
   - ResNet: Deep with skip connections
   - MobileNet: Lightweight and efficient
   - Choose based on accuracy vs speed vs size trade-offs

5. **Decision Framework**
   - When to use transfer learning
   - Feature extraction vs fine-tuning choice
   - Model architecture selection

### Best Practices:

- Always start with feature extraction as baseline
- Use very low learning rates for fine-tuning (1e-5 to 1e-4)
- Freeze early layers, unfreeze later layers
- Monitor validation metrics to prevent overfitting
- Consider model size for deployment constraints

### What's Next?

- [Module 11: Debugging Neural Networks](11_debugging_neural_networks.ipynb)
- Advanced: Domain adaptation, multi-task learning
- Specialized models: Object detection (YOLO), Segmentation (U-Net)

### Additional Resources:

1. Keras Applications documentation: https://keras.io/api/applications/
2. "How transferable are features in deep neural networks?" (Yosinski et al., 2014)
3. ImageNet website: http://www.image-net.org/
4. Transfer Learning Guide: https://www.tensorflow.org/tutorials/images/transfer_learning