# Lesson 4: ResNet50 Transfer Learning for Flower Classification

## Overview
Learn transfer learning with ResNet50 on the Flowers102 dataset. This lesson demonstrates how deeper networks can improve performance and compares results with ResNet18 from Lesson 3.

### Learning Objectives
- Understand ResNet50 architecture and bottleneck blocks
- Implement transfer learning with a deeper network
- Compare performance between ResNet18 and ResNet50
- Analyze computational trade-offs between model depth and performance

### Model Quick Facts
- **Architecture**: ResNet50 (50 layers, 25.6M parameters)
- **Pre-training**: ImageNet dataset (1.2M images, 1000 classes)
- **Key Innovation**: Bottleneck blocks for efficient deep networks
- **Transfer Method**: Feature extraction + fine-tuning
- **Expected Performance**: ~88%+ accuracy on Flowers102 (vs ~85% for ResNet18)


## Step 1: Environment Setup and Library Imports

### Why This Step Matters
Setting up the environment correctly is crucial for:
- **Reproducibility**: Ensuring consistent results across different runs
- **Performance**: Optimizing GPU usage and memory management
- **Debugging**: Clean output without unnecessary warnings

### Key Libraries Explained
- **torch**: Core PyTorch library (tensors, automatic differentiation, neural networks)
- **torchvision**: Computer vision utilities (datasets, transforms, pre-trained models)
- **models**: Pre-trained model architectures (ResNet50, ResNet18, etc.)
- **optim**: Optimization algorithms (SGD, Adam, AdamW)
- **DataLoader**: Efficient batch processing and parallel data loading
- **tqdm**: Progress bars for training loops
- **matplotlib**: Data visualization and plotting
- **sklearn**: Machine learning utilities (metrics, confusion matrix)

### Configuration Settings
We configure matplotlib for high-quality visualizations and set up proper warning filters for cleaner output during training.


In [None]:
# Core PyTorch libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

# Computer vision utilities
import torchvision
import torchvision.transforms as transforms
from torchvision import models

# Data handling and visualization
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import time
import copy

# Machine learning utilities
from sklearn.metrics import confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

# Configure matplotlib for high-quality plots
plt.rcParams['figure.dpi'] = 100
plt.rcParams['font.size'] = 10
plt.style.use('default')

print("✅ Libraries imported successfully!")
print(f"📦 PyTorch version: {torch.__version__}")
print(f"🖼️ Torchvision version: {torchvision.__version__}")
print(f"🔥 CUDA available: {torch.cuda.is_available()}")
print(f"🍎 MPS available: {torch.backends.mps.is_available()}")


## Step 2: Device Detection and Configuration

### Device Selection Strategy
ResNet50 requires more computational resources than ResNet18. Our device detection follows this priority:

1. **CUDA GPU** (NVIDIA): Highly recommended for ResNet50 training
   - Parallel processing with thousands of cores
   - Large memory capacity for deep networks
   - Optimized for matrix operations

2. **MPS (Apple Silicon)**: Apple's Metal Performance Shaders
   - Efficient on M1/M2 chips
   - May need batch size reduction for memory constraints
   - Good performance for development

3. **CPU**: Not recommended for ResNet50
   - Very slow training (hours instead of minutes)
   - Use only for testing/debugging

### Training Configuration
We use the same parameters as ResNet18 for fair comparison:
- **Batch Size**: 32 (may need reduction to 16 for memory limits)
- **Learning Rate**: 0.001 (standard for AdamW optimizer)
- **Epochs**: 50 total (20 frozen + 30 fine-tuning)
- **Optimizer**: AdamW with weight decay


In [None]:
# Device detection with fallback hierarchy
print("🔍 Detecting optimal compute device...")

if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"🚀 Using NVIDIA GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"   💡 ResNet50 recommended: Good memory for deep network")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("🍎 Using Apple Silicon GPU (MPS)")
    print("   Optimized for M1/M2 chips")
    print("   ⚠️  May need batch size reduction for ResNet50")
else:
    device = torch.device("cpu")
    print("💻 Using CPU")
    print("   ⚠️  NOT recommended for ResNet50 - very slow training")

# Set training configuration
print("\n⚙️ Setting up training configuration...")
config = {
    'batch_size': 32,  # May need reduction for memory limits
    'learning_rate': 0.001,
    'epochs': 50,
    'freeze_epochs': 20,
    'finetune_epochs': 30,
    'num_workers': 2,
    'weight_decay': 0.01
}

print(f"   📦 Batch size: {config['batch_size']} (reduce to 16 if memory issues)")
print(f"   🎯 Learning rate: {config['learning_rate']}")
print(f"   🔄 Total epochs: {config['epochs']} (freeze: {config['freeze_epochs']}, fine-tune: {config['finetune_epochs']})")
print(f"   👥 Workers: {config['num_workers']}")
print(f"   ⚖️ Weight decay: {config['weight_decay']}")

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)
    torch.backends.cudnn.deterministic = True

print("\n✅ Configuration complete!")
print("💡 If you encounter memory issues, reduce batch_size to 16 or 8")


## Step 3: Data Preprocessing and DataLoader Setup

### Data Augmentation Strategy

**Why Augmentation is Critical for ResNet50:**
- **Prevents Overfitting**: Deeper networks are more prone to overfitting
- **Increases Effective Dataset Size**: More parameters need more data variations
- **Improves Generalization**: Helps the model handle real-world variations
- **Maximizes Transfer Learning**: Augmentation helps adaptation to new domain

**Training vs. Validation Transforms:**
- **Training**: Aggressive augmentation for robustness
- **Validation/Test**: Minimal transforms for consistent evaluation

### ImageNet Normalization
Critical for pre-trained models - ResNet50 expects exact ImageNet statistics:
- **Mean**: [0.485, 0.456, 0.406] for RGB channels
- **Std**: [0.229, 0.224, 0.225] for RGB channels

### Memory Considerations
ResNet50 uses more memory than ResNet18:
- **Batch Size**: May need reduction from 32 to 16 or 8
- **Workers**: Monitor CPU usage during data loading


In [None]:
print("🔧 Creating data preprocessing pipeline...")

# Training transforms with augmentation
train_transforms = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.RandomCrop(224),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(degrees=15),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Validation transforms (no augmentation)
val_transforms = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

print("   ✓ Training transforms: 5 augmentations + ImageNet normalization")
print("   ✓ Validation transforms: resize + ImageNet normalization only")

# Create datasets
print("\n📦 Loading Flowers102 dataset...")
try:
    train_dataset = torchvision.datasets.Flowers102(
        root='./data', split='train', transform=train_transforms, download=True)
    val_dataset = torchvision.datasets.Flowers102(
        root='./data', split='val', transform=val_transforms, download=True)
    test_dataset = torchvision.datasets.Flowers102(
        root='./data', split='test', transform=val_transforms, download=True)
    
    print(f"   🏋️ Training samples: {len(train_dataset):,}")
    print(f"   🔍 Validation samples: {len(val_dataset):,}")
    print(f"   📝 Test samples: {len(test_dataset):,}")
    
except Exception as e:
    print(f"   ❌ Error loading dataset: {e}")
    print("   💡 Make sure you have internet connection for first download")

# Create DataLoaders with memory monitoring
print("\n⚙️ Setting up DataLoaders...")
try:
    train_loader = DataLoader(train_dataset, batch_size=config['batch_size'], 
                             shuffle=True, num_workers=config['num_workers'], pin_memory=True)
    val_loader = DataLoader(val_dataset, batch_size=config['batch_size'], 
                           shuffle=False, num_workers=config['num_workers'], pin_memory=True)
    test_loader = DataLoader(test_dataset, batch_size=config['batch_size'], 
                            shuffle=False, num_workers=config['num_workers'], pin_memory=True)
    
    print(f"   📊 DataLoader batches: {len(train_loader)} train, {len(val_loader)} val, {len(test_loader)} test")
    print("   ✅ Data pipeline ready!")
    
except Exception as e:
    print(f"   ❌ Error creating DataLoaders: {e}")
    print("   💡 Try reducing batch_size or num_workers")
    print("   💡 Suggested fix: config['batch_size'] = 16")


## Step 4: ResNet50 Model Setup and Architecture Analysis

### ResNet50 vs ResNet18 Comparison

| Feature | ResNet18 | ResNet50 | Impact |
|---------|----------|----------|---------|
| **Layers** | 18 | 50 | 2.8× deeper |
| **Parameters** | 11.7M | 25.6M | 2.2× more |
| **Model Size** | ~47MB | ~102MB | 2.2× larger |
| **Memory Usage** | ~2GB | ~3-4GB | 1.5-2× more |
| **Training Time** | 15-20 min | 25-35 min | 1.5-2× slower |

### Bottleneck Block Innovation
ResNet50 uses bottleneck blocks instead of basic blocks:
- **1×1 Conv**: Reduces channels for efficiency
- **3×3 Conv**: Processes features with reduced channels
- **1×1 Conv**: Expands channels back to original size
- **Skip Connection**: Enables deep network training

### Transfer Learning Advantages
ResNet50's depth provides:
- **Richer Feature Hierarchy**: More complex pattern recognition
- **Better Generalization**: Proven performance on diverse tasks
- **Stable Training**: Residual connections prevent vanishing gradients


In [None]:
print("🏗️ Setting up ResNet50 model...")

# Load pre-trained ResNet50
model = models.resnet50(pretrained=True)
print(f"   ✓ Loaded pre-trained ResNet50")
print(f"   📊 Original final layer: {model.fc.in_features} → 1000 classes")

# Modify final layer for Flowers102 (102 classes)
num_classes = 102
model.fc = nn.Linear(model.fc.in_features, num_classes)
print(f"   🎯 Modified final layer: {model.fc.in_features} → {num_classes} classes")

# Move model to device
model = model.to(device)
print(f"   🚀 Model moved to {device}")

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"   📈 Total parameters: {total_params:,}")
print(f"   🎯 Trainable parameters: {trainable_params:,}")
print(f"   📊 Model size: {total_params * 4 / 1e6:.1f} MB (float32)")

# Compare with ResNet18
resnet18_params = 11_689_512  # Known ResNet18 parameter count
print(f"\n📊 ResNet50 vs ResNet18 comparison:")
print(f"   📈 Parameter ratio: {total_params / resnet18_params:.1f}× more parameters")
print(f"   💾 Memory ratio: {total_params / resnet18_params:.1f}× more memory")

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=config['learning_rate'], weight_decay=config['weight_decay'])

print(f"\n⚙️ Training setup:")
print(f"   🎯 Loss function: CrossEntropyLoss")
print(f"   🚀 Optimizer: AdamW (lr={config['learning_rate']}, weight_decay={config['weight_decay']})")

# Function to freeze/unfreeze model parameters
def set_parameter_requires_grad(model, feature_extracting):
    if feature_extracting:
        for param in model.parameters():
            param.requires_grad = False
        # Only train the classifier
        for param in model.fc.parameters():
            param.requires_grad = True
    else:
        for param in model.parameters():
            param.requires_grad = True

print("✅ ResNet50 model setup complete!")
print("💡 Ready for two-phase training: feature extraction → fine-tuning")


## Step 5: Training and Evaluation Functions

### Function Design for Deep Networks
Our training functions are optimized for deeper networks like ResNet50:
- **Memory Management**: Efficient GPU memory usage
- **Progress Monitoring**: Real-time loss and accuracy tracking
- **Error Handling**: Graceful handling of memory issues
- **Performance Metrics**: Comprehensive evaluation

### Training Strategy
We use the same two-phase approach as ResNet18 for fair comparison:
1. **Phase 1**: Feature extraction (frozen backbone)
2. **Phase 2**: End-to-end fine-tuning (unfrozen network)

### Memory Optimization
The functions include automatic memory cleanup to handle ResNet50's higher memory usage.


In [None]:
def train_epoch(model, train_loader, criterion, optimizer, device):
    """Train model for one epoch with memory optimization"""
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0
    
    progress_bar = tqdm(train_loader, desc="Training", leave=False)
    
    for batch_idx, (data, targets) in enumerate(progress_bar):
        data, targets = data.to(device), targets.to(device)
        
        optimizer.zero_grad()
        outputs = model(data)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()
        
        # Update progress bar
        progress_bar.set_postfix({
            'Loss': f'{running_loss/(batch_idx+1):.3f}',
            'Acc': f'{100.*correct/total:.2f}%'
        })
        
        # Memory cleanup for ResNet50
        del data, targets, outputs, loss
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
    
    return running_loss / len(train_loader), 100. * correct / total

def evaluate(model, val_loader, criterion, device):
    """Evaluate model on validation set with memory optimization"""
    model.eval()
    val_loss = 0.0
    correct = 0
    total = 0
    
    with torch.no_grad():
        progress_bar = tqdm(val_loader, desc="Evaluating", leave=False)
        
        for batch_idx, (data, targets) in enumerate(progress_bar):
            data, targets = data.to(device), targets.to(device)
            outputs = model(data)
            loss = criterion(outputs, targets)
            
            val_loss += loss.item()
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()
            
            progress_bar.set_postfix({
                'Loss': f'{val_loss/(batch_idx+1):.3f}',
                'Acc': f'{100.*correct/total:.2f}%'
            })
            
            # Memory cleanup for ResNet50
            del data, targets, outputs, loss
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
    
    return val_loss / len(val_loader), 100. * correct / total

print("✅ Training and evaluation functions defined!")
print("💡 Functions include memory optimization for ResNet50")


## Step 6: Phase 1 - Feature Extraction Training

### Feature Extraction with ResNet50
In Phase 1, we freeze the deeper ResNet50 backbone and train only the classifier:

**Why This Works Well for ResNet50:**
- **Pre-trained Features**: 50 layers of ImageNet features are very rich
- **Computational Efficiency**: Only training ~100K parameters vs 25.6M
- **Memory Efficiency**: Lower memory usage during backpropagation
- **Stable Learning**: Avoids disturbing learned features initially

**Expected Performance:**
- **ResNet18**: ~75% accuracy after Phase 1
- **ResNet50**: ~78% accuracy after Phase 1 (3% improvement)

**Training Details:**
- **Frozen Parameters**: 25.5M parameters (99.6% of network)
- **Trainable Parameters**: ~100K parameters (final layer only)
- **Duration**: 20 epochs (same as ResNet18 for comparison)


In [None]:
print("🎯 Phase 1: Feature Extraction Training (ResNet50)")
print("="*60)

# Freeze backbone, only train classifier
set_parameter_requires_grad(model, feature_extracting=True)
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
frozen_params = total_params - trainable_params

print(f"   🔒 Frozen parameters: {frozen_params:,} ({frozen_params/total_params*100:.1f}%)")
print(f"   🎯 Trainable parameters: {trainable_params:,} ({trainable_params/total_params*100:.1f}%)")
print(f"   📊 Training efficiency: {frozen_params/trainable_params:.0f}× fewer parameters to train")

# Training tracking
train_losses = []
train_accuracies = []
val_losses = []
val_accuracies = []

print(f"\n🚀 Starting Phase 1 training ({config['freeze_epochs']} epochs)...")
print("💡 This may take longer than ResNet18 due to deeper network")
phase1_start = time.time()

best_val_acc = 0.0
best_model_wts = copy.deepcopy(model.state_dict())

try:
    for epoch in range(config['freeze_epochs']):
        epoch_start = time.time()
        
        # Training
        train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
        
        # Validation
        val_loss, val_acc = evaluate(model, val_loader, criterion, device)
        
        # Save best model
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            best_model_wts = copy.deepcopy(model.state_dict())
        
        # Record metrics
        train_losses.append(train_loss)
        train_accuracies.append(train_acc)
        val_losses.append(val_loss)
        val_accuracies.append(val_acc)
        
        epoch_time = time.time() - epoch_start
        
        print(f"Epoch {epoch+1:2d}/{config['freeze_epochs']} | "
              f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}% | "
              f"Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.2f}% | "
              f"Time: {epoch_time:.1f}s")
        
        # Memory monitoring
        if torch.cuda.is_available():
            memory_used = torch.cuda.memory_allocated() / 1e9
            if memory_used > 0.5:  # Show if using > 0.5GB
                print(f"           GPU Memory: {memory_used:.1f}GB")

except Exception as e:
    print(f"❌ Training error: {e}")
    print("💡 Try reducing batch_size in config if memory error")

phase1_time = time.time() - phase1_start

print(f"\n📊 Phase 1 Results:")
print(f"   ⏱️  Training time: {phase1_time:.1f}s ({phase1_time/60:.1f}m)")
print(f"   🎯 Best validation accuracy: {best_val_acc:.2f}%")
print(f"   📈 Final training accuracy: {train_accuracies[-1]:.2f}%")
print(f"   📉 Final validation loss: {val_losses[-1]:.4f}")

# ResNet18 comparison (expected values)
resnet18_phase1_acc = 75.0  # Expected ResNet18 Phase 1 accuracy
improvement = best_val_acc - resnet18_phase1_acc
print(f"\n🔍 Comparison with ResNet18:")
print(f"   📊 ResNet18 Phase 1: ~{resnet18_phase1_acc:.0f}%")
print(f"   📊 ResNet50 Phase 1: {best_val_acc:.2f}%")
print(f"   🚀 Improvement: {improvement:+.1f}% (deeper network advantage)")

# Load best model weights
model.load_state_dict(best_model_wts)
print("✅ Phase 1 complete! Best model weights loaded.")


## Step 7: Phase 2 - Fine-tuning Training

### Fine-tuning ResNet50
In Phase 2, we unfreeze all layers and train the entire ResNet50 network:

**Why ResNet50 Fine-tuning is Powerful:**
- **Deep Feature Adaptation**: 50 layers can adapt to flower-specific features
- **Hierarchical Learning**: Low-level features adapt while high-level features fine-tune
- **Superior Performance**: Deeper networks typically achieve better fine-tuning results
- **Stable Training**: Residual connections enable stable deep network training

**Expected Performance:**
- **ResNet18**: ~85% accuracy after Phase 2
- **ResNet50**: ~88% accuracy after Phase 2 (3% improvement)

**Training Details:**
- **Trainable Parameters**: All 25.6M parameters
- **Memory Usage**: Significantly higher than Phase 1
- **Training Time**: Longer per epoch due to full network backpropagation
- **Learning Rate**: Same as ResNet18 (0.001) for fair comparison

### Memory Management
ResNet50 fine-tuning requires careful memory management:
- **GPU Memory**: ~3-4GB required (vs ~2GB for ResNet18)
- **Batch Size**: May need reduction if memory issues occur
- **Gradient Accumulation**: Automatic cleanup helps prevent memory leaks


In [None]:
print("🔥 Phase 2: Fine-tuning Training (ResNet50)")
print("="*60)

# Unfreeze all layers for fine-tuning
set_parameter_requires_grad(model, feature_extracting=False)
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"   🔓 All parameters unfrozen")
print(f"   🎯 Trainable parameters: {trainable_params:,} (100% of network)")
print(f"   📊 Full network training: {trainable_params/1e6:.1f}M parameters")

# Create new optimizer for fine-tuning
optimizer_ft = optim.AdamW(model.parameters(), lr=config['learning_rate'], weight_decay=config['weight_decay'])

print(f"\n🚀 Starting Phase 2 training ({config['finetune_epochs']} epochs)...")
print("💡 Fine-tuning will take longer than Phase 1 (full network backprop)")
print("💡 Memory usage will be higher - monitor for potential issues")

phase2_start = time.time()

# Continue from Phase 1 metrics
phase1_epochs = len(train_losses)
current_best_val_acc = best_val_acc
best_model_wts = copy.deepcopy(model.state_dict())

try:
    for epoch in range(config['finetune_epochs']):
        epoch_start = time.time()
        
        # Training
        train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer_ft, device)
        
        # Validation
        val_loss, val_acc = evaluate(model, val_loader, criterion, device)
        
        # Save best model
        if val_acc > current_best_val_acc:
            current_best_val_acc = val_acc
            best_model_wts = copy.deepcopy(model.state_dict())
        
        # Record metrics
        train_losses.append(train_loss)
        train_accuracies.append(train_acc)
        val_losses.append(val_loss)
        val_accuracies.append(val_acc)
        
        epoch_time = time.time() - epoch_start
        
        print(f"Epoch {epoch+1:2d}/{config['finetune_epochs']} | "
              f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}% | "
              f"Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.2f}% | "
              f"Time: {epoch_time:.1f}s")
        
        # Enhanced memory monitoring for fine-tuning
        if torch.cuda.is_available():
            memory_used = torch.cuda.memory_allocated() / 1e9
            memory_reserved = torch.cuda.memory_reserved() / 1e9
            print(f"           GPU Memory: {memory_used:.1f}GB used, {memory_reserved:.1f}GB reserved")
            
            # Warning if memory usage is high
            if memory_used > 8:  # 8GB threshold
                print("           ⚠️  High memory usage - consider reducing batch size")

except Exception as e:
    print(f"❌ Training error: {e}")
    print("💡 Common fixes for ResNet50:")
    print("   - Reduce batch_size to 16 or 8")
    print("   - Reduce num_workers to 1")
    print("   - Ensure sufficient GPU memory (>4GB recommended)")

phase2_time = time.time() - phase2_start
total_time = phase1_time + phase2_time

print(f"\n📊 Phase 2 Results:")
print(f"   ⏱️  Training time: {phase2_time:.1f}s ({phase2_time/60:.1f}m)")
print(f"   🎯 Best validation accuracy: {current_best_val_acc:.2f}%")
print(f"   📈 Final training accuracy: {train_accuracies[-1]:.2f}%")

print(f"\n🎉 Complete ResNet50 Training Summary:")
print(f"   ⏱️  Total time: {total_time:.1f}s ({total_time/60:.1f}m)")
print(f"   📊 Phase 1 → Phase 2: {val_accuracies[phase1_epochs-1]:.2f}% → {current_best_val_acc:.2f}%")
print(f"   🚀 Fine-tuning gain: {current_best_val_acc - val_accuracies[phase1_epochs-1]:+.1f}%")

# Comprehensive comparison with ResNet18
resnet18_final_acc = 85.0  # Expected ResNet18 final accuracy
final_improvement = current_best_val_acc - resnet18_final_acc
print(f"\n🔍 Final ResNet50 vs ResNet18 Comparison:")
print(f"   📊 ResNet18 final: ~{resnet18_final_acc:.0f}%")
print(f"   📊 ResNet50 final: {current_best_val_acc:.2f}%")
print(f"   🚀 Depth advantage: {final_improvement:+.1f}%")
print(f"   ⚡ Training time ratio: {total_time/1200:.1f}× (ResNet18 ~20min baseline)")

# Load best model weights
model.load_state_dict(best_model_wts)
print("✅ Phase 2 complete! Best ResNet50 model weights loaded.")


## Step 8: Final Model Evaluation and Results Analysis

### Test Set Evaluation
Now we evaluate our trained ResNet50 model on the held-out test set:

**Why Test Set Evaluation Matters:**
- **Unbiased Performance**: Test set hasn't been seen during training
- **Generalization Check**: Validates model's ability to handle new data
- **Fair Comparison**: Consistent evaluation across all model architectures
- **Real-world Simulation**: Mimics deployment performance

### ResNet50 vs ResNet18 Final Comparison

**Expected Results:**
- **ResNet18 Test Accuracy**: ~83-85%
- **ResNet50 Test Accuracy**: ~86-88%
- **Improvement**: +3-5% from deeper architecture

**Performance Analysis:**
- **Computational Cost**: ResNet50 uses 2.2× more parameters
- **Training Time**: ResNet50 takes ~1.5-2× longer to train
- **Memory Usage**: ResNet50 requires ~1.5× more GPU memory
- **Accuracy Gain**: ResNet50 typically achieves 3-5% better accuracy

### Key Insights from ResNet50 Training
1. **Depth Benefits**: Deeper networks capture more complex features
2. **Diminishing Returns**: Performance gains plateau with extreme depth
3. **Memory Management**: Deeper networks require careful resource management
4. **Transfer Learning**: Pre-trained features work well across architectures


In [None]:
print("🧪 Final Model Evaluation on Test Set")
print("="*50)

# Evaluate on test set
print("🔍 Evaluating trained ResNet50 on test set...")
test_start = time.time()

try:
    test_loss, test_acc = evaluate(model, test_loader, criterion, device)
    test_time = time.time() - test_start
    
    print(f"\n📊 Test Set Results:")
    print(f"   🎯 Test Accuracy: {test_acc:.2f}%")
    print(f"   📉 Test Loss: {test_loss:.4f}")
    print(f"   ⏱️  Evaluation time: {test_time:.1f}s")
    
    # Performance analysis
    print(f"\n📈 ResNet50 Performance Summary:")
    print(f"   🎯 Final test accuracy: {test_acc:.2f}%")
    print(f"   📊 Best validation accuracy: {current_best_val_acc:.2f}%")
    print(f"   📋 Generalization gap: {current_best_val_acc - test_acc:.2f}%")
    
    # Compare with ResNet18
    resnet18_test_acc = 84.0  # Expected ResNet18 test accuracy
    improvement = test_acc - resnet18_test_acc
    print(f"\n🔍 Architecture Comparison:")
    print(f"   📊 ResNet18 (expected): ~{resnet18_test_acc:.0f}%")
    print(f"   📊 ResNet50 (actual): {test_acc:.2f}%")
    print(f"   🚀 Depth advantage: {improvement:+.1f}%")
    
    # Efficiency analysis
    print(f"\n⚡ Efficiency Analysis:")
    print(f"   📈 Parameter count: {total_params:,} (2.2× ResNet18)")
    print(f"   ⏱️  Training time: {total_time/60:.1f}m (~1.5× ResNet18)")
    print(f"   🎯 Accuracy per parameter: {test_acc/(total_params/1e6):.2f}% per M params")
    print(f"   🚀 Accuracy per minute: {test_acc/(total_time/60):.2f}% per minute")
    
except Exception as e:
    print(f"❌ Test evaluation error: {e}")
    print("💡 Ensure model and data are properly loaded")

print(f"\n" + "="*70)
print("🎉 RESNET50 TRANSFER LEARNING COMPLETE!")
print("="*70)

print(f"\n📋 Final Summary:")
print(f"   🏗️  Architecture: ResNet50 (50 layers, 25.6M parameters)")
print(f"   📊 Dataset: Flowers102 (102 classes, 8,189 images)")
print(f"   🎯 Final accuracy: {test_acc:.2f}%")
print(f"   ⏱️  Total training time: {total_time/60:.1f} minutes")
print(f"   🚀 Improvement over ResNet18: {improvement:+.1f}%")

print(f"\n🎓 Key Learnings:")
print(f"   • Deeper networks (ResNet50) provide better feature extraction")
print(f"   • Bottleneck blocks enable efficient deep network training")
print(f"   • Transfer learning works exceptionally well with deeper models")
print(f"   • Performance gains come at computational cost")
print(f"   • Memory management becomes critical with deeper networks")

print(f"\n📈 Next Steps:")
print(f"   • Lesson 5: EfficientNet-B0 (efficient architecture)")
print(f"   • Lesson 6: EfficientNet-B3 (scaled efficient architecture)")
print(f"   • Lesson 7: MobileNet-V2 (mobile-optimized architecture)")
print(f"   • Compare all architectures for optimal model selection")

print(f"\n⭐ ResNet50 Transfer Learning Success!")
