# Football Club Logo Classification — English Premier League

**Deep Learning Final Project**

This notebook implements multi-class image classification of English Premier League football club logos using Convolutional Neural Networks in PyTorch.

**Dataset:** ~20,000 images across 20 Premier League clubs ([Kaggle](https://www.kaggle.com/datasets/alexteboul/english-premier-league-logo-detection-20k-images))

**Project Structure:**

| File | Description |
|---|---|
| `models.py` | CNN architectures: SimpleCNN, DeepCNN, ResNet50 wrapper |
| `train_utils.py` | Training loop, evaluation, early stopping, seed utility |
| `data.py` | Transforms, dataset loading, stratified splitting, DataLoaders |
| `visualization.py` | Plotting functions, Grad-CAM implementation |
| This notebook | Experiments, results, and analysis |

**Experiments:**
1. **Part 1** — Custom CNNs from scratch with optimizer comparison, BatchNorm ablation, dropout & weight decay regularization
2. **Part 2** — Transfer learning via CIFAR-100 pretraining
3. **Part 3** — Fine-tuning pretrained ResNet50 with different freezing strategies
4. **Final Analysis** — Confusion matrix, per-class metrics, Grad-CAM, summary comparison

In [None]:
# ── Setup: Clone repo & install dependencies ────────────────────────────────
!git clone https://github.com/Guyisra26/Premier-League-Logo-Classification.git
%cd Premier-League-Logo-Classification
!pip install -q kagglehub

from pathlib import Path
from download_data import download_and_reset_data

DATA_DIR = Path("data")
if not DATA_DIR.exists() or not any(DATA_DIR.iterdir()):
    download_and_reset_data()
else:
    print(f"Dataset already exists at: {DATA_DIR}")

In [None]:
# ── Imports ─────────────────────────────────────────────────────────────────
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

from sklearn.metrics import confusion_matrix, classification_report

# Project modules
from models import SimpleCNN, DeepCNN, get_resnet50
from train_utils import set_seed, train_model, count_parameters
from data import (find_data_dir, explore_dataset, get_cnn_transforms,
                  get_resnet_transforms, stratified_split, create_loaders,
                  load_cifar100)
from visualization import (plot_curves, plot_comparison, show_samples,
                           plot_class_distribution, GradCAM, visualize_gradcam)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

In [None]:
# ── Configuration ───────────────────────────────────────────────────────────
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

SEED = 42
BATCH_SIZE = 64
NUM_CLASSES = 20
NUM_EPOCHS = 30
PATIENCE = 7

set_seed(SEED)

# Store all experiment results
results = {}

---
## Section 1: Data Loading & Exploration

The dataset contains ~20,000 images of English Premier League club logos, organized in class-named folders. We explore the distribution, define augmentation transforms, and create stratified train/val/test splits (70/15/15).

In [None]:
# Discover and explore dataset
DATA_DIR = find_data_dir('./data')
class_names, class_counts = explore_dataset(DATA_DIR)
plot_class_distribution(class_counts)

In [None]:
# Load full dataset for splitting, then create stratified indices
from torchvision import datasets

_, val_tf_cnn = get_cnn_transforms()
full_dataset = datasets.ImageFolder(DATA_DIR, transform=val_tf_cnn)
class_names_dataset = full_dataset.classes

print(f"Classes: {class_names_dataset}\n")
train_idx, val_idx, test_idx = stratified_split(full_dataset, seed=SEED)

In [None]:
# Create DataLoaders for custom CNNs (128x128) and ResNet50 (224x224)
train_tf_cnn, val_tf_cnn = get_cnn_transforms()
train_tf_resnet, val_tf_resnet = get_resnet_transforms()

train_loader_cnn, val_loader_cnn, test_loader_cnn = create_loaders(
    DATA_DIR, train_idx, val_idx, test_idx,
    train_tf_cnn, val_tf_cnn, batch_size=BATCH_SIZE
)

train_loader_resnet, val_loader_resnet, test_loader_resnet = create_loaders(
    DATA_DIR, train_idx, val_idx, test_idx,
    train_tf_resnet, val_tf_resnet, batch_size=BATCH_SIZE
)

print(f"CNN loaders — Train: {len(train_loader_cnn)} batches, "
      f"Val: {len(val_loader_cnn)}, Test: {len(test_loader_cnn)}")
print(f"ResNet loaders — Train: {len(train_loader_resnet)} batches, "
      f"Val: {len(val_loader_resnet)}, Test: {len(test_loader_resnet)}")

In [None]:
# Visualize sample images per class
from torch.utils.data import Subset
sample_dataset = Subset(
    datasets.ImageFolder(DATA_DIR, transform=val_tf_cnn), val_idx[:500]
)
show_samples(sample_dataset, class_names_dataset, n_per_class=3)

---
## Part 1 — Training CNNs from Scratch

We design two CNN architectures to isolate the effect of depth and skip connections:

| Model | Blocks | Filters | Key Feature |
|---|---|---|---|
| **SimpleCNN** | 3 conv | 32→64→128 | Flat FC head, minimal capacity |
| **DeepCNN** | 5 conv | 32→64→128→256→256 | Skip connections, GAP, configurable BN/Dropout |

**Why these architectures?** SimpleCNN serves as a minimal baseline — can a simple feature extractor distinguish 20 logo classes? DeepCNN adds depth with skip connections to test whether richer feature hierarchies improve classification without vanishing gradient issues.

In [None]:
# Verify model architectures
dummy = torch.randn(2, 3, 128, 128).to(device)

print("SimpleCNN:")
m = SimpleCNN(NUM_CLASSES).to(device)
count_parameters(m)
print(f"  Output shape: {m(dummy).shape}\n")

print("DeepCNN (BN=True, dropout=0.5):")
m = DeepCNN(NUM_CLASSES, use_batchnorm=True, dropout_rate=0.5).to(device)
count_parameters(m)
print(f"  Output shape: {m(dummy).shape}")

del m, dummy

### Step 1: Optimizer Comparison — Adam vs SGD+Momentum

We compare two widely used optimizers:
- **Adam** (lr=1e-3, default betas): Adaptive per-parameter learning rates. Expected to converge faster.
- **SGD+Momentum** (lr=1e-2, momentum=0.9, StepLR decay every 15 epochs): Fixed learning rate with momentum. Expected to generalize better but converge slower.

Both optimizers are tested on both architectures, with all other hyperparameters held constant.

In [None]:
# ── SimpleCNN + Adam ────────────────────────────────────────────────────────
print("="*60)
print("SimpleCNN + Adam (lr=1e-3)")
print("="*60)
set_seed(SEED)
model_simple_adam = SimpleCNN(NUM_CLASSES).to(device)
optimizer = optim.Adam(model_simple_adam.parameters(), lr=1e-3)

results['SimpleCNN_Adam'] = train_model(
    model_simple_adam, train_loader_cnn, val_loader_cnn,
    nn.CrossEntropyLoss(), optimizer,
    epochs=NUM_EPOCHS, patience=PATIENCE, device=device
)
plot_curves(results['SimpleCNN_Adam'], 'SimpleCNN + Adam')

In [None]:
# ── SimpleCNN + SGD+Momentum ────────────────────────────────────────────────
print("="*60)
print("SimpleCNN + SGD+Momentum (lr=1e-2)")
print("="*60)
set_seed(SEED)
model_simple_sgd = SimpleCNN(NUM_CLASSES).to(device)
optimizer = optim.SGD(model_simple_sgd.parameters(), lr=1e-2, momentum=0.9)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=15, gamma=0.1)

results['SimpleCNN_SGD'] = train_model(
    model_simple_sgd, train_loader_cnn, val_loader_cnn,
    nn.CrossEntropyLoss(), optimizer, scheduler=scheduler,
    epochs=NUM_EPOCHS, patience=PATIENCE, device=device
)
plot_curves(results['SimpleCNN_SGD'], 'SimpleCNN + SGD+Momentum')

In [None]:
# ── SimpleCNN comparison ────────────────────────────────────────────────────
plot_comparison(
    [results['SimpleCNN_Adam'], results['SimpleCNN_SGD']],
    ['Adam', 'SGD+Momentum'],
    title='SimpleCNN: Optimizer Comparison'
)
for k in ['SimpleCNN_Adam', 'SimpleCNN_SGD']:
    h = results[k]
    print(f"  {k:<20} Best val acc: {h['best_val_acc']:.4f} (epoch {h['best_epoch']})")

In [None]:
# ── DeepCNN + Adam ──────────────────────────────────────────────────────────
print("="*60)
print("DeepCNN + Adam (lr=1e-3)")
print("="*60)
set_seed(SEED)
model_deep_adam = DeepCNN(NUM_CLASSES, use_batchnorm=True, dropout_rate=0.5).to(device)
optimizer = optim.Adam(model_deep_adam.parameters(), lr=1e-3)

results['DeepCNN_Adam'] = train_model(
    model_deep_adam, train_loader_cnn, val_loader_cnn,
    nn.CrossEntropyLoss(), optimizer,
    epochs=NUM_EPOCHS, patience=PATIENCE, device=device
)
plot_curves(results['DeepCNN_Adam'], 'DeepCNN + Adam')

In [None]:
# ── DeepCNN + SGD+Momentum ──────────────────────────────────────────────────
print("="*60)
print("DeepCNN + SGD+Momentum (lr=1e-2)")
print("="*60)
set_seed(SEED)
model_deep_sgd = DeepCNN(NUM_CLASSES, use_batchnorm=True, dropout_rate=0.5).to(device)
optimizer = optim.SGD(model_deep_sgd.parameters(), lr=1e-2, momentum=0.9)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=15, gamma=0.1)

results['DeepCNN_SGD'] = train_model(
    model_deep_sgd, train_loader_cnn, val_loader_cnn,
    nn.CrossEntropyLoss(), optimizer, scheduler=scheduler,
    epochs=NUM_EPOCHS, patience=PATIENCE, device=device
)
plot_curves(results['DeepCNN_SGD'], 'DeepCNN + SGD+Momentum')

In [None]:
# ── DeepCNN comparison ──────────────────────────────────────────────────────
plot_comparison(
    [results['DeepCNN_Adam'], results['DeepCNN_SGD']],
    ['Adam', 'SGD+Momentum'],
    title='DeepCNN: Optimizer Comparison'
)
for k in ['DeepCNN_Adam', 'DeepCNN_SGD']:
    h = results[k]
    print(f"  {k:<20} Best val acc: {h['best_val_acc']:.4f} (epoch {h['best_epoch']})")

### Analysis: Architecture & Optimizer Comparison

**Architecture observations:**
- Compare SimpleCNN vs DeepCNN with the same optimizer — the deeper model should show higher capacity and better final accuracy, but may overfit more if regularization is insufficient.
- The skip connections in DeepCNN help gradient flow through 5 blocks, which would otherwise risk vanishing gradients.

**Optimizer observations:**
- Adam typically converges within fewer epochs due to its adaptive learning rate — useful when compute is limited.
- SGD+Momentum with StepLR scheduling takes longer to converge but the learning rate decay at epoch 15 often improves final generalization.
- Compare the gap between train and val curves for each optimizer — a smaller gap indicates better generalization.

**Key question:** Does the faster convergence of Adam come at the cost of generalization for this dataset?

### Step 2: Batch Normalization Ablation

We train DeepCNN with and without BatchNorm to isolate its effect on:
- Training stability (smoother loss curve?)
- Convergence speed (fewer epochs to a given accuracy?)
- Final validation performance

In [None]:
# ── DeepCNN WITHOUT BatchNorm ───────────────────────────────────────────────
print("="*60)
print("DeepCNN WITHOUT BatchNorm (Adam, lr=1e-3)")
print("="*60)
set_seed(SEED)
model_deep_nobn = DeepCNN(NUM_CLASSES, use_batchnorm=False, dropout_rate=0.5).to(device)
count_parameters(model_deep_nobn)
optimizer = optim.Adam(model_deep_nobn.parameters(), lr=1e-3)

results['DeepCNN_NoBN'] = train_model(
    model_deep_nobn, train_loader_cnn, val_loader_cnn,
    nn.CrossEntropyLoss(), optimizer,
    epochs=NUM_EPOCHS, patience=PATIENCE, device=device
)
plot_curves(results['DeepCNN_NoBN'], 'DeepCNN (No BatchNorm)')

In [None]:
# ── BatchNorm ablation comparison ───────────────────────────────────────────
plot_comparison(
    [results['DeepCNN_Adam'], results['DeepCNN_NoBN']],
    ['With BatchNorm', 'Without BatchNorm'],
    title='BatchNorm Ablation (DeepCNN + Adam)'
)
print(f"  With BN:    val acc = {results['DeepCNN_Adam']['best_val_acc']:.4f}")
print(f"  Without BN: val acc = {results['DeepCNN_NoBN']['best_val_acc']:.4f}")

### Analysis: Batch Normalization

BatchNorm normalizes intermediate activations to zero mean and unit variance, providing several benefits:

1. **Training stability**: By reducing internal covariate shift, BatchNorm allows higher learning rates and smoother optimization landscapes.
2. **Implicit regularization**: The mini-batch statistics introduce noise that acts as a mild regularizer.
3. **Faster convergence**: The normalized activations prevent saturation in ReLU networks.

**Expected outcome**: Without BatchNorm, the 5-block DeepCNN should show more volatile training loss, slower convergence, and potentially lower final accuracy. The deeper the model, the more critical BatchNorm becomes.

### Step 3: Regularization — Dropout & Weight Decay

We test three regularization techniques:
1. **Dropout ablation**: rates of 0.0, 0.3, and 0.5 in the classifier head
2. **Weight Decay (L2 regularization)**: `weight_decay=1e-4` in the optimizer
3. **Data augmentation**: already applied to all training runs (flips, rotation, color jitter, random crop)

Each experiment changes one variable at a time to isolate the effect.

In [None]:
# ── Dropout ablation ────────────────────────────────────────────────────────
for dr in [0.0, 0.3]:
    print("="*60)
    print(f"DeepCNN with dropout={dr} (Adam, lr=1e-3)")
    print("="*60)
    set_seed(SEED)
    model_dr = DeepCNN(NUM_CLASSES, use_batchnorm=True, dropout_rate=dr).to(device)
    optimizer = optim.Adam(model_dr.parameters(), lr=1e-3)

    results[f'DeepCNN_dropout{dr}'] = train_model(
        model_dr, train_loader_cnn, val_loader_cnn,
        nn.CrossEntropyLoss(), optimizer,
        epochs=NUM_EPOCHS, patience=PATIENCE, device=device
    )
    print()

In [None]:
# ── Weight Decay experiment ─────────────────────────────────────────────────
print("="*60)
print("DeepCNN + Adam with weight_decay=1e-4 (L2 regularization)")
print("="*60)
set_seed(SEED)
model_deep_wd = DeepCNN(NUM_CLASSES, use_batchnorm=True, dropout_rate=0.5).to(device)
optimizer = optim.Adam(model_deep_wd.parameters(), lr=1e-3, weight_decay=1e-4)

results['DeepCNN_WeightDecay'] = train_model(
    model_deep_wd, train_loader_cnn, val_loader_cnn,
    nn.CrossEntropyLoss(), optimizer,
    epochs=NUM_EPOCHS, patience=PATIENCE, device=device
)
plot_curves(results['DeepCNN_WeightDecay'], 'DeepCNN + Adam + Weight Decay')

In [None]:
# ── Regularization comparison plots ─────────────────────────────────────────

# Dropout comparison
plot_comparison(
    [results['DeepCNN_dropout0.0'], results['DeepCNN_dropout0.3'],
     results['DeepCNN_Adam']],
    ['Dropout=0.0', 'Dropout=0.3', 'Dropout=0.5 (default)'],
    title='Dropout Rate Ablation'
)

# Weight decay comparison
plot_comparison(
    [results['DeepCNN_Adam'], results['DeepCNN_WeightDecay']],
    ['No Weight Decay', 'Weight Decay=1e-4'],
    title='Weight Decay (L2 Regularization) Ablation'
)

# All regularization in one view
plot_comparison(
    [results['DeepCNN_dropout0.0'], results['DeepCNN_Adam'],
     results['DeepCNN_WeightDecay']],
    ['No Dropout, No WD', 'Dropout=0.5', 'Dropout=0.5 + WD=1e-4'],
    title='Regularization Effect Overview'
)

### Analysis: Regularization

**Dropout:**
- Dropout=0.0 (no dropout) — the model has maximum capacity but is most prone to overfitting. Expect the train-val gap to be widest.
- Dropout=0.3 — mild regularization, may offer a good balance.
- Dropout=0.5 — stronger regularization, may slightly limit training accuracy but should improve generalization.

**Weight Decay (L2):**
- L2 regularization penalizes large weights, encouraging the model to learn smoother decision boundaries.
- It works differently from dropout: dropout drops neurons randomly, while weight decay constrains weight magnitudes.
- Using both together can be complementary — dropout prevents co-adaptation while weight decay prevents overly complex individual features.

**Data Augmentation** (applied in all experiments):
- Random horizontal flips, rotations (15°), color jitter, and random resized crops effectively increase the training set diversity.
- This is especially important for logo classification where logos can appear at different scales and orientations.

In [None]:
# ── Part 1 Summary Table ────────────────────────────────────────────────────
print("\n" + "="*75)
print("PART 1 SUMMARY — CNNs from Scratch")
print("="*75)

part1_keys = [
    ('SimpleCNN + Adam', 'SimpleCNN_Adam'),
    ('SimpleCNN + SGD', 'SimpleCNN_SGD'),
    ('DeepCNN + Adam', 'DeepCNN_Adam'),
    ('DeepCNN + SGD', 'DeepCNN_SGD'),
    ('DeepCNN (no BN)', 'DeepCNN_NoBN'),
    ('DeepCNN dropout=0.0', 'DeepCNN_dropout0.0'),
    ('DeepCNN dropout=0.3', 'DeepCNN_dropout0.3'),
    ('DeepCNN + Weight Decay', 'DeepCNN_WeightDecay'),
]
print(f"{'Model':<28} {'Best Val Acc':>12} {'Best Epoch':>11} {'Time (s)':>10}")
print("-" * 63)
for name, key in part1_keys:
    h = results[key]
    print(f"{name:<28} {h['best_val_acc']:>12.4f} {h['best_epoch']:>11} "
          f"{h['training_time']:>10.1f}")

---
## Part 2 — Transfer Learning via CIFAR-100 Pretraining

**Goal:** Test whether pretraining on an external dataset improves logo classification.

**Why CIFAR-100?**
- 100 classes of natural images (animals, objects, vehicles) — forces the network to learn diverse, discriminative features.
- Structurally different from logos (natural photos vs. graphic designs), making the transfer non-trivial.
- CIFAR-100 over CIFAR-10 because the higher class count produces richer feature representations.

**Procedure:**
1. Pretrain DeepCNN on CIFAR-100 (upscaled to 128×128) for 30 epochs
2. Remove the 100-class head, attach a new 20-class head
3. Fine-tune on logo dataset with reduced learning rate (1e-4)

In [None]:
# Load CIFAR-100
cifar_train_loader, cifar_val_loader = load_cifar100(batch_size=BATCH_SIZE)

In [None]:
# ── Pretrain DeepCNN on CIFAR-100 ───────────────────────────────────────────
print("="*60)
print("Pretraining DeepCNN on CIFAR-100 (100 classes)")
print("="*60)
set_seed(SEED)
model_cifar_pretrain = DeepCNN(num_classes=100, use_batchnorm=True,
                               dropout_rate=0.5).to(device)
count_parameters(model_cifar_pretrain)
optimizer = optim.Adam(model_cifar_pretrain.parameters(), lr=1e-3)

history_cifar = train_model(
    model_cifar_pretrain, cifar_train_loader, cifar_val_loader,
    nn.CrossEntropyLoss(), optimizer,
    epochs=NUM_EPOCHS, patience=PATIENCE, device=device
)
plot_curves(history_cifar, 'DeepCNN Pretraining on CIFAR-100')

In [None]:
# ── Transfer to logo dataset ────────────────────────────────────────────────
print("="*60)
print("Fine-tuning CIFAR-100 pretrained DeepCNN on logo dataset")
print("="*60)
set_seed(SEED)
model_pretrained = DeepCNN(num_classes=NUM_CLASSES, use_batchnorm=True,
                           dropout_rate=0.5).to(device)

# Load backbone weights, skip classifier (shape mismatch: 100 vs 20)
pretrained_dict = model_cifar_pretrain.state_dict()
model_dict = model_pretrained.state_dict()
pretrained_dict = {k: v for k, v in pretrained_dict.items()
                   if k in model_dict and v.shape == model_dict[k].shape}
model_dict.update(pretrained_dict)
model_pretrained.load_state_dict(model_dict)
print(f"Loaded {len(pretrained_dict)}/{len(model_dict)} weight tensors from CIFAR-100 model")

optimizer = optim.Adam(model_pretrained.parameters(), lr=1e-4)

results['DeepCNN_Pretrained'] = train_model(
    model_pretrained, train_loader_cnn, val_loader_cnn,
    nn.CrossEntropyLoss(), optimizer,
    epochs=NUM_EPOCHS, patience=PATIENCE, device=device
)
plot_curves(results['DeepCNN_Pretrained'], 'DeepCNN (CIFAR-100 Pretrained)')

In [None]:
# ── Pretrained vs From-Scratch comparison ──────────────────────────────────
plot_comparison(
    [results['DeepCNN_Adam'], results['DeepCNN_Pretrained']],
    ['From Scratch (Adam)', 'CIFAR-100 Pretrained'],
    title='DeepCNN: From Scratch vs CIFAR-100 Pretrained'
)
print(f"  From scratch: val acc = {results['DeepCNN_Adam']['best_val_acc']:.4f} "
      f"(epoch {results['DeepCNN_Adam']['best_epoch']})")
print(f"  Pretrained:   val acc = {results['DeepCNN_Pretrained']['best_val_acc']:.4f} "
      f"(epoch {results['DeepCNN_Pretrained']['best_epoch']})")

### Analysis: CIFAR-100 Pretraining

**What we expect:**
- The pretrained model should converge faster because the convolutional filters already encode useful low-level features (edges, textures, color gradients).
- The final accuracy gain may be modest because of the domain gap: CIFAR-100 contains natural photos while our target is graphically designed logos.

**What to look for in the curves:**
- Early epochs: Does the pretrained model start at a higher accuracy? (Indicates useful feature transfer)
- Convergence: Does it reach its best accuracy in fewer epochs?
- Final performance: Is the ceiling higher or just reached sooner?

**Interpretation of results:**
If pretraining helps despite the domain gap, it suggests that low-level features are domain-agnostic — a key insight about what CNNs learn in their early layers.

---
## Part 3 — Transfer Learning with Pretrained ResNet50

ResNet50 pretrained on ImageNet provides a powerful feature backbone trained on 1.2M images across 1,000 classes. We experiment with three freezing strategies:

| Strategy | Trainable Layers | Learning Rate | Rationale |
|---|---|---|---|
| **Fully frozen** | FC head only | 1e-3 | Test if ImageNet features are sufficient as-is |
| **Partial unfreeze** | layer4 + FC | 1e-4 | Allow high-level features to adapt |
| **Full fine-tune** | All layers | backbone 1e-5, head 1e-3 | Maximum adaptation with differential LR |

**Why these strategies?** If fully frozen already achieves high accuracy, it proves ImageNet features transfer well to logos. If full fine-tuning is needed, the domain gap is significant.

In [None]:
# ── Experiment 3a: Fully Frozen (head only) ────────────────────────────────
print("="*60)
print("ResNet50 — Fully Frozen (head only)")
print("="*60)
set_seed(SEED)
model_resnet_frozen = get_resnet50(NUM_CLASSES, 'full_freeze').to(device)
count_parameters(model_resnet_frozen)
optimizer = optim.Adam(
    filter(lambda p: p.requires_grad, model_resnet_frozen.parameters()), lr=1e-3
)

results['ResNet50_Frozen'] = train_model(
    model_resnet_frozen, train_loader_resnet, val_loader_resnet,
    nn.CrossEntropyLoss(), optimizer,
    epochs=NUM_EPOCHS, patience=PATIENCE, device=device
)
plot_curves(results['ResNet50_Frozen'], 'ResNet50 (Fully Frozen)')

In [None]:
# ── Experiment 3b: Partial Unfreeze (layer4 + head) ────────────────────────
print("="*60)
print("ResNet50 — Partial Unfreeze (layer4 + head)")
print("="*60)
set_seed(SEED)
model_resnet_partial = get_resnet50(NUM_CLASSES, 'partial').to(device)
count_parameters(model_resnet_partial)
optimizer = optim.Adam(
    filter(lambda p: p.requires_grad, model_resnet_partial.parameters()), lr=1e-4
)

results['ResNet50_Partial'] = train_model(
    model_resnet_partial, train_loader_resnet, val_loader_resnet,
    nn.CrossEntropyLoss(), optimizer,
    epochs=NUM_EPOCHS, patience=PATIENCE, device=device
)
plot_curves(results['ResNet50_Partial'], 'ResNet50 (Partial Unfreeze)')

In [None]:
# ── Experiment 3c: Full Fine-tune (differential LR) ────────────────────────
print("="*60)
print("ResNet50 — Full Fine-tune (differential LR)")
print("="*60)
set_seed(SEED)
model_resnet_full = get_resnet50(NUM_CLASSES, 'full_finetune').to(device)
count_parameters(model_resnet_full)
optimizer = optim.Adam([
    {'params': [p for n, p in model_resnet_full.named_parameters()
                if 'fc' not in n], 'lr': 1e-5},
    {'params': model_resnet_full.fc.parameters(), 'lr': 1e-3},
])

results['ResNet50_FullFinetune'] = train_model(
    model_resnet_full, train_loader_resnet, val_loader_resnet,
    nn.CrossEntropyLoss(), optimizer,
    epochs=NUM_EPOCHS, patience=PATIENCE, device=device
)
plot_curves(results['ResNet50_FullFinetune'], 'ResNet50 (Full Fine-tune)')

In [None]:
# ── ResNet50 strategies comparison ──────────────────────────────────────────
plot_comparison(
    [results['ResNet50_Frozen'], results['ResNet50_Partial'],
     results['ResNet50_FullFinetune']],
    ['Fully Frozen', 'Partial (layer4)', 'Full Fine-tune'],
    title='ResNet50 Freezing Strategies'
)
for k in ['ResNet50_Frozen', 'ResNet50_Partial', 'ResNet50_FullFinetune']:
    h = results[k]
    print(f"  {k:<25} val acc = {h['best_val_acc']:.4f} (epoch {h['best_epoch']})")

### Analysis: ResNet50 Transfer Learning

**Feature extraction (frozen):**
- If this achieves high accuracy, it confirms that ImageNet features are highly transferable to logo classification.
- The model trains very fast since only ~40K parameters are updated.

**Partial unfreezing (layer4):**
- Unfreezing layer4 allows the network to refine its high-level feature representations for the logo domain.
- This is often the sweet spot: enough adaptation without catastrophic forgetting of pretrained features.

**Full fine-tuning:**
- Differential learning rates (1e-5 for backbone, 1e-3 for head) prevent the pretrained weights from being destroyed.
- This provides maximum adaptation but risks overfitting if the dataset is too small.

**Comparison with custom CNNs:**
- ResNet50 has ~23.5M parameters vs ~3.5M for DeepCNN — but with freezing, far fewer are trained.
- The key question: is the performance gain worth the additional model complexity?

---
## Final Analysis & Comparison

In [None]:
# ── Summary results table ───────────────────────────────────────────────────
summary_data = []
model_configs = {
    'SimpleCNN + Adam':       ('SimpleCNN_Adam',       SimpleCNN(NUM_CLASSES)),
    'SimpleCNN + SGD':        ('SimpleCNN_SGD',        SimpleCNN(NUM_CLASSES)),
    'DeepCNN + Adam':         ('DeepCNN_Adam',         DeepCNN(NUM_CLASSES)),
    'DeepCNN + SGD':          ('DeepCNN_SGD',          DeepCNN(NUM_CLASSES)),
    'DeepCNN (no BN)':        ('DeepCNN_NoBN',         DeepCNN(NUM_CLASSES, use_batchnorm=False)),
    'DeepCNN (dropout=0.0)':  ('DeepCNN_dropout0.0',   DeepCNN(NUM_CLASSES, dropout_rate=0.0)),
    'DeepCNN (dropout=0.3)':  ('DeepCNN_dropout0.3',   DeepCNN(NUM_CLASSES, dropout_rate=0.3)),
    'DeepCNN + Weight Decay': ('DeepCNN_WeightDecay',  DeepCNN(NUM_CLASSES)),
    'DeepCNN (CIFAR-100 PT)': ('DeepCNN_Pretrained',   DeepCNN(NUM_CLASSES)),
    'ResNet50 (Frozen)':      ('ResNet50_Frozen',      get_resnet50(NUM_CLASSES, 'full_freeze')),
    'ResNet50 (Partial)':     ('ResNet50_Partial',     get_resnet50(NUM_CLASSES, 'partial')),
    'ResNet50 (Full FT)':     ('ResNet50_FullFinetune', get_resnet50(NUM_CLASSES, 'full_finetune')),
}

for display_name, (key, model_tmp) in model_configs.items():
    if key in results:
        h = results[key]
        total_p = sum(p.numel() for p in model_tmp.parameters())
        train_p = sum(p.numel() for p in model_tmp.parameters() if p.requires_grad)
        summary_data.append({
            'Model': display_name,
            'Best Val Acc': f"{h['best_val_acc']:.4f}",
            'Best Epoch': h['best_epoch'],
            'Total Params': f"{total_p:,}",
            'Trainable Params': f"{train_p:,}",
            'Time (s)': f"{h['training_time']:.1f}",
        })

df_summary = pd.DataFrame(summary_data)
print("="*80)
print("FINAL RESULTS SUMMARY")
print("="*80)
display(df_summary)

In [None]:
# ── Test set evaluation: Confusion Matrix ──────────────────────────────────
best_key = max(results, key=lambda k: results[k]['best_val_acc'])
print(f"Best model by validation accuracy: {best_key} "
      f"(val acc = {results[best_key]['best_val_acc']:.4f})\n")

is_resnet = 'ResNet50' in best_key
test_loader = test_loader_resnet if is_resnet else test_loader_cnn

model_map = {
    'SimpleCNN_Adam': model_simple_adam,
    'SimpleCNN_SGD': model_simple_sgd,
    'DeepCNN_Adam': model_deep_adam,
    'DeepCNN_SGD': model_deep_sgd,
    'DeepCNN_NoBN': model_deep_nobn,
    'DeepCNN_WeightDecay': model_deep_wd,
    'DeepCNN_Pretrained': model_pretrained,
    'ResNet50_Frozen': model_resnet_frozen,
    'ResNet50_Partial': model_resnet_partial,
    'ResNet50_FullFinetune': model_resnet_full,
}

best_model = model_map[best_key]
best_model.eval()

all_preds, all_labels = [], []
with torch.no_grad():
    for images, labels in test_loader:
        outputs = best_model(images.to(device))
        all_preds.extend(outputs.argmax(1).cpu().numpy())
        all_labels.extend(labels.numpy())

test_acc = np.mean(np.array(all_preds) == np.array(all_labels))
print(f"Test Accuracy: {test_acc:.4f}")

cm = confusion_matrix(all_labels, all_preds)
plt.figure(figsize=(14, 12))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=class_names_dataset,
            yticklabels=class_names_dataset)
plt.xlabel('Predicted'); plt.ylabel('True')
plt.title(f'Confusion Matrix — {best_key} (Test Acc: {test_acc:.4f})')
plt.xticks(rotation=45, ha='right', fontsize=8)
plt.yticks(fontsize=8)
plt.tight_layout()
plt.show()

In [None]:
# ── Per-class metrics ──────────────────────────────────────────────────────
print("Per-Class Classification Report:")
print("="*70)
print(classification_report(all_labels, all_preds,
                            target_names=class_names_dataset))

report = classification_report(all_labels, all_preds,
                               target_names=class_names_dataset, output_dict=True)

class_f1 = {cls: report[cls]['f1-score'] for cls in class_names_dataset}
plt.figure(figsize=(14, 5))
plt.bar(class_f1.keys(), class_f1.values(), color='teal')
plt.axhline(y=np.mean(list(class_f1.values())), color='red', linestyle='--',
            label=f"Mean F1: {np.mean(list(class_f1.values())):.3f}")
plt.xticks(rotation=45, ha='right', fontsize=8)
plt.ylabel('F1-Score')
plt.title('Per-Class F1 Scores')
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# ── Grad-CAM Visualization ─────────────────────────────────────────────────
if 'ResNet50' in best_key:
    target_layer = best_model.layer4[-1]
    grad_loader = test_loader_resnet
else:
    target_layer = (best_model.block5 if hasattr(best_model, 'block5')
                    else best_model.features[-1])
    grad_loader = test_loader_cnn

visualize_gradcam(best_model, target_layer, grad_loader,
                  class_names_dataset, device, n_samples=6)

### Analysis: Test Set Results

**Confusion matrix observations:**
- Look for off-diagonal clusters — which logo pairs are most frequently confused?
- Clubs with visually similar logos (similar colors, circular shapes) may be harder to distinguish.
- Low-sample classes (if any) may show weaker performance due to limited training data.

**Per-class F1 scores:**
- Classes with lower F1 scores indicate logos that are harder for the model to classify.
- Large variance across classes suggests the model struggles with specific visual features.

**Grad-CAM:**
- The heatmaps show which image regions most influenced the model's prediction.
- Ideally, the model should attend to distinctive logo features (crests, text, unique symbols) rather than background artifacts.
- If the model focuses on irrelevant regions, it suggests potential generalization issues.

---
## Final Reflection

### What Architectural Choices Mattered Most

1. **Pretrained weights** — The gap between training from scratch and fine-tuning ResNet50 is the single largest factor. ImageNet features provide a massive head start even for graphically designed logos.
2. **Network depth with skip connections** — DeepCNN outperforms SimpleCNN, confirming that the added depth (with residual connections to maintain gradient flow) enables better feature hierarchies.
3. **Batch Normalization** — Removing BN from the 5-block model noticeably degrades stability and convergence, making it essential for deeper custom architectures.
4. **Optimizer choice** — Adam converges faster, which is practical under time constraints. SGD+Momentum may match or slightly exceed Adam's final accuracy with careful scheduling.
5. **Regularization** — Dropout and weight decay help close the train-val gap. Their impact is secondary to architecture and initialization, but meaningful for the final few percentage points.

### When Transfer Learning Helped

- **CIFAR-100 pretraining**: Likely provides a modest improvement through better low-level features (edges, textures). The domain gap limits the benefit of higher-level features.
- **ResNet50 (ImageNet)**: Provides the strongest results due to the massive and diverse pretraining dataset. Even the frozen backbone extracts highly useful features.

**Key insight**: Transfer learning is not a universal solution — its effectiveness depends on domain similarity. Low-level features transfer well across domains, but high-level features require adaptation.

### What Would I Do Differently

- **Data**: Higher resolution images with more background diversity would make the task harder and more realistic.
- **Architectures**: Test EfficientNet or Vision Transformers (ViT) as alternatives to ResNet50.
- **Augmentation**: Try CutMix or MixUp to help the from-scratch models close the gap with transfer learning.
- **Pretraining dataset**: Use a logo-specific dataset (e.g., FlickrLogos-32) instead of CIFAR-100 to test domain-matched vs general pretraining.