# Football Club Logo Classification — English Premier League

**Deep Learning Final Project**

Classifying 20 Premier League club logos using CNNs in PyTorch.

**Dataset:** ~20,000 images across 20 clubs from [Kaggle](https://www.kaggle.com/datasets/alexteboul/english-premier-league-logo-detection-20k-images)

**Structure:**
- **Part 1** — Train custom CNNs from scratch, compare architectures/optimizers, tune hyperparameters, test regularization
- **Part 2** — Transfer learning: pretrain on CIFAR-100, fine-tune on logos
- **Part 3** — Fine-tune pretrained ResNet50 with different freezing strategies

In [None]:
# Setup
import os
if not os.path.exists('Premier-League-Logo-Classification'):
    !git clone https://github.com/Guyisra26/Premier-League-Logo-Classification.git
%cd Premier-League-Logo-Classification
!pip install -q kagglehub

from pathlib import Path
from download_data import download_and_reset_data

DATA_DIR = Path("data")
if not DATA_DIR.exists() or not any(DATA_DIR.iterdir()):
    download_and_reset_data()
else:
    print(f"Dataset already exists at: {DATA_DIR}")

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
warnings.filterwarnings('ignore')

from sklearn.metrics import confusion_matrix, classification_report
from torchvision import datasets
from torch.utils.data import Subset

from models import SimpleCNN, DeepCNN, get_resnet50
from train_utils import set_seed, train_model, count_parameters
from data import (find_data_dir, explore_dataset, get_cnn_transforms,
                  get_cnn_transforms_no_aug, get_resnet_transforms,
                  stratified_split, create_loaders, load_cifar100)
from visualization import (plot_curves, plot_comparison, show_samples,
                           plot_class_distribution, visualize_gradcam)

print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")

In [None]:
# Configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

SEED = 42
BATCH_SIZE = 64
NUM_CLASSES = 20
NUM_EPOCHS = 12
PATIENCE = 4

set_seed(SEED)
results = {}

WEIGHTS_DIR = './saved_weights'
os.makedirs(WEIGHTS_DIR, exist_ok=True)

---
## Data Loading & Exploration

In [None]:
DATA_DIR = find_data_dir('./data')
class_names, class_counts = explore_dataset(DATA_DIR)
plot_class_distribution(class_counts)

In [None]:
# Stratified split: 70% train, 15% val, 15% test
_, val_tf_cnn = get_cnn_transforms()
full_dataset = datasets.ImageFolder(DATA_DIR, transform=val_tf_cnn)
class_names_dataset = full_dataset.classes
print(f"Classes: {class_names_dataset}\n")

train_idx, val_idx, test_idx = stratified_split(full_dataset, seed=SEED)

In [None]:
# DataLoaders for custom CNNs (128x128) and ResNet (224x224)
train_tf_cnn, val_tf_cnn = get_cnn_transforms()
train_tf_resnet, val_tf_resnet = get_resnet_transforms()

train_loader_cnn, val_loader_cnn, test_loader_cnn = create_loaders(
    DATA_DIR, train_idx, val_idx, test_idx,
    train_tf_cnn, val_tf_cnn, batch_size=BATCH_SIZE)

train_loader_resnet, val_loader_resnet, test_loader_resnet = create_loaders(
    DATA_DIR, train_idx, val_idx, test_idx,
    train_tf_resnet, val_tf_resnet, batch_size=BATCH_SIZE)

print(f"CNN: {len(train_loader_cnn)} train batches, {len(val_loader_cnn)} val, {len(test_loader_cnn)} test")
print(f"ResNet: {len(train_loader_resnet)} train batches")

In [None]:
# Sample images
sample_dataset = Subset(datasets.ImageFolder(DATA_DIR, transform=val_tf_cnn), val_idx[:500])
show_samples(sample_dataset, class_names_dataset, n_per_class=3)

---
## Part 1 — Training CNNs from Scratch

I chose two architectures to compare:

- **SimpleCNN** — 3 conv blocks (32→64→128), flat FC head. A minimal baseline.
- **DeepCNN** — 5 conv blocks (32→64→128→256→256) with skip connections and Global Average Pooling. More capacity, but also more risk of overfitting.

The idea is to see if the extra depth actually helps, or if a simpler model is enough for 20 logo classes.

In [None]:
# Check model architectures
dummy = torch.randn(2, 3, 128, 128).to(device)

print("SimpleCNN:")
m = SimpleCNN(NUM_CLASSES).to(device)
count_parameters(m)
print(f"  Output: {m(dummy).shape}\n")

print("DeepCNN:")
m = DeepCNN(NUM_CLASSES).to(device)
count_parameters(m)
print(f"  Output: {m(dummy).shape}")
del m, dummy

### Step 1: Optimizer Comparison — Adam vs SGD+Momentum

I'm comparing Adam (lr=1e-3) and SGD with momentum (lr=1e-2, StepLR scheduler) on both architectures. Everything else stays the same so the comparison is fair.

In [None]:
# SimpleCNN + Adam
set_seed(SEED)
model_simple_adam = SimpleCNN(NUM_CLASSES).to(device)
optimizer = optim.Adam(model_simple_adam.parameters(), lr=1e-3)
results['SimpleCNN_Adam'] = train_model(
    model_simple_adam, train_loader_cnn, val_loader_cnn,
    nn.CrossEntropyLoss(), optimizer, epochs=NUM_EPOCHS, patience=PATIENCE, device=device)
plot_curves(results['SimpleCNN_Adam'], 'SimpleCNN + Adam')

In [None]:
# SimpleCNN + SGD
set_seed(SEED)
model_simple_sgd = SimpleCNN(NUM_CLASSES).to(device)
optimizer = optim.SGD(model_simple_sgd.parameters(), lr=1e-2, momentum=0.9)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
results['SimpleCNN_SGD'] = train_model(
    model_simple_sgd, train_loader_cnn, val_loader_cnn,
    nn.CrossEntropyLoss(), optimizer, scheduler=scheduler,
    epochs=NUM_EPOCHS, patience=PATIENCE, device=device)
plot_curves(results['SimpleCNN_SGD'], 'SimpleCNN + SGD')

In [None]:
# DeepCNN + Adam
set_seed(SEED)
model_deep_adam = DeepCNN(NUM_CLASSES).to(device)
optimizer = optim.Adam(model_deep_adam.parameters(), lr=1e-3)
results['DeepCNN_Adam'] = train_model(
    model_deep_adam, train_loader_cnn, val_loader_cnn,
    nn.CrossEntropyLoss(), optimizer, epochs=NUM_EPOCHS, patience=PATIENCE, device=device)
plot_curves(results['DeepCNN_Adam'], 'DeepCNN + Adam')

In [None]:
# DeepCNN + SGD
set_seed(SEED)
model_deep_sgd = DeepCNN(NUM_CLASSES).to(device)
optimizer = optim.SGD(model_deep_sgd.parameters(), lr=1e-2, momentum=0.9)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
results['DeepCNN_SGD'] = train_model(
    model_deep_sgd, train_loader_cnn, val_loader_cnn,
    nn.CrossEntropyLoss(), optimizer, scheduler=scheduler,
    epochs=NUM_EPOCHS, patience=PATIENCE, device=device)
plot_curves(results['DeepCNN_SGD'], 'DeepCNN + SGD')

In [None]:
# Compare all 4 combinations
plot_comparison(
    [results['SimpleCNN_Adam'], results['SimpleCNN_SGD'],
     results['DeepCNN_Adam'], results['DeepCNN_SGD']],
    ['SimpleCNN+Adam', 'SimpleCNN+SGD', 'DeepCNN+Adam', 'DeepCNN+SGD'],
    title='Architecture & Optimizer Comparison')

for k in ['SimpleCNN_Adam', 'SimpleCNN_SGD', 'DeepCNN_Adam', 'DeepCNN_SGD']:
    h = results[k]
    print(f"  {k:<20} val acc: {h['best_val_acc']:.4f}  (epoch {h['best_epoch']})")

### Step 1b: Learning Rate Sweep

Testing lr = {1e-2, 1e-3, 1e-4} with DeepCNN + Adam to find the best learning rate. I reuse the lr=1e-3 result from above.

In [None]:
# LR sweep
lr_results = {'lr=1e-3': results['DeepCNN_Adam']}

for lr in [1e-2, 1e-4]:
    set_seed(SEED)
    model = DeepCNN(NUM_CLASSES).to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr)
    key = f'lr={lr}'
    results[f'DeepCNN_lr{lr}'] = train_model(
        model, train_loader_cnn, val_loader_cnn,
        nn.CrossEntropyLoss(), optimizer,
        epochs=NUM_EPOCHS, patience=PATIENCE, device=device)
    lr_results[key] = results[f'DeepCNN_lr{lr}']
    del model

plot_comparison(list(lr_results.values()), list(lr_results.keys()),
                title='Learning Rate Sweep (DeepCNN + Adam)')
for name, h in lr_results.items():
    print(f"  {name}: val acc = {h['best_val_acc']:.4f} (epoch {h['best_epoch']})")

### Step 1c: Batch Size Sweep

Testing batch_size = {32, 64, 128}. Reusing bs=64 from above.

In [None]:
# Batch size sweep
bs_results = {'bs=64': results['DeepCNN_Adam']}

for bs in [32, 128]:
    set_seed(SEED)
    tl, vl, _ = create_loaders(DATA_DIR, train_idx, val_idx, test_idx,
                                train_tf_cnn, val_tf_cnn, batch_size=bs)
    model = DeepCNN(NUM_CLASSES).to(device)
    optimizer = optim.Adam(model.parameters(), lr=1e-3)
    key = f'bs={bs}'
    results[f'DeepCNN_bs{bs}'] = train_model(
        model, tl, vl, nn.CrossEntropyLoss(), optimizer,
        epochs=NUM_EPOCHS, patience=PATIENCE, device=device)
    bs_results[key] = results[f'DeepCNN_bs{bs}']
    del model, tl, vl

plot_comparison(list(bs_results.values()), list(bs_results.keys()),
                title='Batch Size Sweep (DeepCNN + Adam)')
for name, h in bs_results.items():
    print(f"  {name}: val acc = {h['best_val_acc']:.4f} (epoch {h['best_epoch']})")

### Analysis: Architecture, Optimizers & Hyperparameters

**Architecture:** DeepCNN outperformed SimpleCNN with both optimizers. The extra depth and skip connections helped the model learn more complex features for distinguishing the 20 logos. The skip connections are important because without them, 5 blocks would likely suffer from vanishing gradients.

**Optimizers:** Adam converged faster than SGD in both architectures. This makes sense because Adam adapts the learning rate per parameter. SGD with the StepLR schedule was slower but still reached reasonable accuracy.

**Learning rate:** The sweep showed that the default lr=1e-3 for Adam works well. lr=1e-2 was too aggressive (loss was unstable), and lr=1e-4 converged too slowly.

**Batch size:** The differences between batch sizes were smaller than expected. Smaller batches (32) add more noise which can help regularize, while larger batches (128) train faster per epoch but may generalize slightly worse.

**Overfitting:** Looking at the train vs val curves, the gap between training and validation accuracy grows over time — this is the classic sign of overfitting. The deeper model (DeepCNN) shows a bigger gap, which makes sense because it has more parameters.

### Step 2: Batch Normalization Ablation

Training DeepCNN with and without BatchNorm to see how much it matters.

In [None]:
# DeepCNN without BatchNorm
set_seed(SEED)
model_deep_nobn = DeepCNN(NUM_CLASSES, use_batchnorm=False).to(device)
count_parameters(model_deep_nobn)
optimizer = optim.Adam(model_deep_nobn.parameters(), lr=1e-3)
results['DeepCNN_NoBN'] = train_model(
    model_deep_nobn, train_loader_cnn, val_loader_cnn,
    nn.CrossEntropyLoss(), optimizer, epochs=NUM_EPOCHS, patience=PATIENCE, device=device)

# Compare
plot_comparison(
    [results['DeepCNN_Adam'], results['DeepCNN_NoBN']],
    ['With BatchNorm', 'Without BatchNorm'],
    title='BatchNorm Ablation')
print(f"  With BN:    {results['DeepCNN_Adam']['best_val_acc']:.4f}")
print(f"  Without BN: {results['DeepCNN_NoBN']['best_val_acc']:.4f}")

### Analysis: Batch Normalization

Removing BatchNorm from DeepCNN caused a noticeable drop in performance. The training was less stable (you can see the loss curve is more jittery) and it converged slower.

This makes sense — BatchNorm normalizes the activations between layers, which helps with:
- **Stability:** prevents activations from blowing up or vanishing
- **Speed:** allows using higher learning rates
- **Regularization:** the mini-batch statistics add a small amount of noise

For a 5-block network like DeepCNN, BatchNorm is clearly important. A simpler 3-block model might get away without it, but the deeper you go, the more you need it.

### Step 3: Regularization — Dropout, Weight Decay & Data Augmentation

Testing three regularization techniques, one at a time:
1. **Dropout** — rates 0.0, 0.3, 0.5
2. **Weight decay** (L2) — 1e-4
3. **Data augmentation** — with vs without

In [None]:
# Dropout ablation
for dr in [0.0, 0.3]:
    set_seed(SEED)
    model = DeepCNN(NUM_CLASSES, dropout_rate=dr).to(device)
    optimizer = optim.Adam(model.parameters(), lr=1e-3)
    results[f'DeepCNN_dr{dr}'] = train_model(
        model, train_loader_cnn, val_loader_cnn,
        nn.CrossEntropyLoss(), optimizer,
        epochs=NUM_EPOCHS, patience=PATIENCE, device=device)
    del model

plot_comparison(
    [results['DeepCNN_dr0.0'], results['DeepCNN_dr0.3'], results['DeepCNN_Adam']],
    ['Dropout=0.0', 'Dropout=0.3', 'Dropout=0.5'],
    title='Dropout Ablation')
for dr, key in [('0.0','DeepCNN_dr0.0'), ('0.3','DeepCNN_dr0.3'), ('0.5','DeepCNN_Adam')]:
    print(f"  Dropout={dr}: {results[key]['best_val_acc']:.4f}")

In [None]:
# Weight decay
set_seed(SEED)
model_deep_wd = DeepCNN(NUM_CLASSES).to(device)
optimizer = optim.Adam(model_deep_wd.parameters(), lr=1e-3, weight_decay=1e-4)
results['DeepCNN_WD'] = train_model(
    model_deep_wd, train_loader_cnn, val_loader_cnn,
    nn.CrossEntropyLoss(), optimizer, epochs=NUM_EPOCHS, patience=PATIENCE, device=device)

plot_comparison(
    [results['DeepCNN_Adam'], results['DeepCNN_WD']],
    ['No Weight Decay', 'Weight Decay=1e-4'],
    title='Weight Decay Ablation')
print(f"  No WD:      {results['DeepCNN_Adam']['best_val_acc']:.4f}")
print(f"  WD=1e-4:    {results['DeepCNN_WD']['best_val_acc']:.4f}")

In [None]:
# Data augmentation ablation
set_seed(SEED)
train_tf_noaug, val_tf_noaug = get_cnn_transforms_no_aug()
tl_noaug, vl_noaug, _ = create_loaders(
    DATA_DIR, train_idx, val_idx, test_idx,
    train_tf_noaug, val_tf_noaug, batch_size=BATCH_SIZE)

model_noaug = DeepCNN(NUM_CLASSES).to(device)
optimizer = optim.Adam(model_noaug.parameters(), lr=1e-3)
results['DeepCNN_NoAug'] = train_model(
    model_noaug, tl_noaug, vl_noaug,
    nn.CrossEntropyLoss(), optimizer,
    epochs=NUM_EPOCHS, patience=PATIENCE, device=device)

plot_comparison(
    [results['DeepCNN_Adam'], results['DeepCNN_NoAug']],
    ['With Augmentation', 'Without Augmentation'],
    title='Data Augmentation Ablation')
print(f"  With aug:    {results['DeepCNN_Adam']['best_val_acc']:.4f}")
print(f"  Without aug: {results['DeepCNN_NoAug']['best_val_acc']:.4f}")
del model_noaug, tl_noaug, vl_noaug

### Analysis: Regularization

**Dropout:** Without any dropout (0.0), the model overfits more — the training accuracy goes very high but the validation accuracy doesn't follow. Dropout=0.5 gave the best generalization by randomly dropping neurons during training, which prevents the network from relying too heavily on specific features.

**Weight decay:** Adding L2 regularization (weight_decay=1e-4) penalizes large weights and encourages simpler solutions. The effect was smaller than dropout — using both together might be too much regularization for this dataset size.

**Data augmentation:** Training without augmentation led to faster overfitting. Without random flips, rotations, and color jitter, the model memorizes the training images instead of learning generalizable features. The train-val gap was much larger without augmentation.

Overall, data augmentation had the biggest regularization impact, followed by dropout, and then weight decay.

In [None]:
# Part 1 summary
print("PART 1 SUMMARY")
print("="*65)
print(f"{'Model':<28} {'Val Acc':>10} {'Epoch':>7} {'Time':>8}")
print("-"*55)
for name, key in [
    ('SimpleCNN + Adam', 'SimpleCNN_Adam'),
    ('SimpleCNN + SGD', 'SimpleCNN_SGD'),
    ('DeepCNN + Adam', 'DeepCNN_Adam'),
    ('DeepCNN + SGD', 'DeepCNN_SGD'),
    ('DeepCNN (no BN)', 'DeepCNN_NoBN'),
    ('DeepCNN dropout=0.0', 'DeepCNN_dr0.0'),
    ('DeepCNN dropout=0.3', 'DeepCNN_dr0.3'),
    ('DeepCNN + weight decay', 'DeepCNN_WD'),
    ('DeepCNN (no augmentation)', 'DeepCNN_NoAug'),
    ('DeepCNN lr=1e-2', 'DeepCNN_lr0.01'),
    ('DeepCNN lr=1e-4', 'DeepCNN_lr0.0001'),
    ('DeepCNN bs=32', 'DeepCNN_bs32'),
    ('DeepCNN bs=128', 'DeepCNN_bs128'),
]:
    if key in results:
        h = results[key]
        print(f"{name:<28} {h['best_val_acc']:>10.4f} {h['best_epoch']:>7} {h['training_time']:>7.1f}s")

In [None]:
# Save best Part 1 models
torch.save(model_deep_adam.state_dict(), os.path.join(WEIGHTS_DIR, 'deep_cnn_best.pth'))
torch.save(model_simple_adam.state_dict(), os.path.join(WEIGHTS_DIR, 'simple_cnn_best.pth'))
print(f"Saved model weights to {WEIGHTS_DIR}/")

---
## Part 2 — Transfer Learning via CIFAR-100 Pretraining

**Idea:** Pretrain the DeepCNN on CIFAR-100 (100 classes of natural images), save the weights, then fine-tune on our logo dataset with a smaller learning rate.

**Why CIFAR-100?**
- 100 classes forces the model to learn diverse features
- It's structurally different from logos (photos vs graphics) — so this tests whether low-level features transfer across domains
- CIFAR-100 over CIFAR-10 because more classes = richer feature representations

In [None]:
cifar_train_loader, cifar_val_loader = load_cifar100(batch_size=BATCH_SIZE)

In [None]:
# Pretrain on CIFAR-100
set_seed(SEED)
model_cifar = DeepCNN(num_classes=100).to(device)
optimizer = optim.Adam(model_cifar.parameters(), lr=1e-3)
history_cifar = train_model(
    model_cifar, cifar_train_loader, cifar_val_loader,
    nn.CrossEntropyLoss(), optimizer,
    epochs=NUM_EPOCHS, patience=PATIENCE, device=device)
plot_curves(history_cifar, 'DeepCNN on CIFAR-100')

# Save pretrained weights
cifar_path = os.path.join(WEIGHTS_DIR, 'cifar100_pretrained.pth')
torch.save(model_cifar.state_dict(), cifar_path)
print(f"Saved pretrained weights to {cifar_path}")

In [None]:
# Load pretrained weights and fine-tune on logos
set_seed(SEED)
model_pretrained = DeepCNN(num_classes=NUM_CLASSES).to(device)

# Load backbone weights (skip classifier since it has different output size)
pretrained_dict = torch.load(cifar_path, map_location=device, weights_only=True)
model_dict = model_pretrained.state_dict()
pretrained_dict = {k: v for k, v in pretrained_dict.items()
                   if k in model_dict and v.shape == model_dict[k].shape}
model_dict.update(pretrained_dict)
model_pretrained.load_state_dict(model_dict)
print(f"Loaded {len(pretrained_dict)}/{len(model_dict)} layers from CIFAR-100 model")

# Fine-tune with smaller learning rate
optimizer = optim.Adam(model_pretrained.parameters(), lr=1e-4)
results['Pretrained'] = train_model(
    model_pretrained, train_loader_cnn, val_loader_cnn,
    nn.CrossEntropyLoss(), optimizer,
    epochs=NUM_EPOCHS, patience=PATIENCE, device=device)
plot_curves(results['Pretrained'], 'CIFAR-100 Pretrained → Logos')

torch.save(model_pretrained.state_dict(), os.path.join(WEIGHTS_DIR, 'pretrained_finetuned.pth'))

In [None]:
# Compare pretrained vs from scratch
plot_comparison(
    [results['DeepCNN_Adam'], results['Pretrained']],
    ['From Scratch', 'CIFAR-100 Pretrained'],
    title='From Scratch vs Pretrained')
print(f"  From scratch: {results['DeepCNN_Adam']['best_val_acc']:.4f} (epoch {results['DeepCNN_Adam']['best_epoch']})")
print(f"  Pretrained:   {results['Pretrained']['best_val_acc']:.4f} (epoch {results['Pretrained']['best_epoch']})")

### Analysis: CIFAR-100 Transfer Learning

The pretrained model converged faster in the early epochs — you can see it starts at a higher accuracy than the from-scratch model. This is because the convolutional layers already know how to detect edges, textures, and basic shapes from CIFAR-100.

However, the final accuracy improvement was modest. This is likely because of the domain gap: CIFAR-100 contains natural photos (animals, vehicles, etc.), while our dataset is graphically designed logos. The low-level features (edges, colors) transfer well, but the higher-level features (object parts, textures) don't match.

The key takeaway is that even pretraining on a mismatched domain helps somewhat — the model doesn't have to learn basic visual features from scratch.

---
## Part 3 — Transfer Learning with Pretrained ResNet50

ResNet50 pretrained on ImageNet (1.2M images, 1000 classes) should give a big boost. I'm testing three freezing strategies:

1. **Fully frozen** — only train the new FC head (fast, tests if ImageNet features are enough)
2. **Partial unfreeze** — unfreeze layer4 + head (let high-level features adapt)
3. **Full fine-tune** — train everything with differential LR (backbone=1e-5, head=1e-3)

In [None]:
# ResNet50 - Fully Frozen
set_seed(SEED)
model_resnet_frozen = get_resnet50(NUM_CLASSES, 'full_freeze').to(device)
count_parameters(model_resnet_frozen)
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model_resnet_frozen.parameters()), lr=1e-3)
results['ResNet_Frozen'] = train_model(
    model_resnet_frozen, train_loader_resnet, val_loader_resnet,
    nn.CrossEntropyLoss(), optimizer, epochs=NUM_EPOCHS, patience=PATIENCE, device=device)
plot_curves(results['ResNet_Frozen'], 'ResNet50 Frozen')

In [None]:
# ResNet50 - Partial Unfreeze
set_seed(SEED)
model_resnet_partial = get_resnet50(NUM_CLASSES, 'partial').to(device)
count_parameters(model_resnet_partial)
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model_resnet_partial.parameters()), lr=1e-4)
results['ResNet_Partial'] = train_model(
    model_resnet_partial, train_loader_resnet, val_loader_resnet,
    nn.CrossEntropyLoss(), optimizer, epochs=NUM_EPOCHS, patience=PATIENCE, device=device)
plot_curves(results['ResNet_Partial'], 'ResNet50 Partial Unfreeze')

In [None]:
# ResNet50 - Full Fine-tune
set_seed(SEED)
model_resnet_full = get_resnet50(NUM_CLASSES, 'full_finetune').to(device)
count_parameters(model_resnet_full)
optimizer = optim.Adam([
    {'params': [p for n, p in model_resnet_full.named_parameters() if 'fc' not in n], 'lr': 1e-5},
    {'params': model_resnet_full.fc.parameters(), 'lr': 1e-3},
])
results['ResNet_Full'] = train_model(
    model_resnet_full, train_loader_resnet, val_loader_resnet,
    nn.CrossEntropyLoss(), optimizer, epochs=NUM_EPOCHS, patience=PATIENCE, device=device)
plot_curves(results['ResNet_Full'], 'ResNet50 Full Fine-tune')

In [None]:
# Compare ResNet strategies
plot_comparison(
    [results['ResNet_Frozen'], results['ResNet_Partial'], results['ResNet_Full']],
    ['Frozen', 'Partial', 'Full Fine-tune'],
    title='ResNet50 Freezing Strategies')
for k in ['ResNet_Frozen', 'ResNet_Partial', 'ResNet_Full']:
    print(f"  {k:<20} val acc: {results[k]['best_val_acc']:.4f} (epoch {results[k]['best_epoch']})")

In [None]:
# Save ResNet models
for name, model in [('resnet_frozen', model_resnet_frozen),
                    ('resnet_partial', model_resnet_partial),
                    ('resnet_full', model_resnet_full)]:
    torch.save(model.state_dict(), os.path.join(WEIGHTS_DIR, f'{name}.pth'))
print(f"Saved ResNet models to {WEIGHTS_DIR}/")

### Analysis: ResNet50

Even the fully frozen ResNet50 (training only the last layer!) achieved strong results. This shows that the features learned on ImageNet are very transferable to logo classification — edges, shapes, textures, and color patterns are useful across domains.

Partial unfreezing (layer4 + head) gave an improvement because it allows the higher-level features to adapt to logos specifically. The lower layers (which detect basic edges and textures) don't need to change much.

Full fine-tuning with differential learning rates gave the best performance. Using a very small lr (1e-5) for the backbone prevents destroying the pretrained features, while the larger lr (1e-3) for the head lets it learn quickly.

Compared to my custom DeepCNN, ResNet50 performed significantly better. This makes sense — it was trained on 1.2 million images and has 50 layers of learned representations. My custom model with ~3.5M parameters can't compete with ResNet50's 23.5M parameters and massive pretraining data.

---
## Final Analysis & Comparison

In [None]:
# Cross-part comparison
best_custom = max(['DeepCNN_Adam', 'DeepCNN_SGD'], key=lambda k: results[k]['best_val_acc'])
best_resnet = max(['ResNet_Frozen', 'ResNet_Partial', 'ResNet_Full'], key=lambda k: results[k]['best_val_acc'])

plot_comparison(
    [results[best_custom], results['Pretrained'], results[best_resnet]],
    [f'Part 1: {best_custom}', 'Part 2: Pretrained', f'Part 3: {best_resnet}'],
    title='Best Model from Each Part')

print("Best model from each part:")
for label, key in [('Part 1 (scratch)', best_custom),
                   ('Part 2 (CIFAR PT)', 'Pretrained'),
                   ('Part 3 (ResNet)', best_resnet)]:
    h = results[key]
    print(f"  {label:<22} {key:<20} val acc={h['best_val_acc']:.4f} (ep {h['best_epoch']}, {h['training_time']:.0f}s)")

In [None]:
# Test set evaluation with best model
best_key = max(results, key=lambda k: results[k]['best_val_acc'])
print(f"Best overall model: {best_key} (val acc = {results[best_key]['best_val_acc']:.4f})\n")

is_resnet = 'ResNet' in best_key
test_loader = test_loader_resnet if is_resnet else test_loader_cnn

model_map = {
    'SimpleCNN_Adam': model_simple_adam, 'SimpleCNN_SGD': model_simple_sgd,
    'DeepCNN_Adam': model_deep_adam, 'DeepCNN_SGD': model_deep_sgd,
    'DeepCNN_NoBN': model_deep_nobn, 'DeepCNN_WD': model_deep_wd,
    'Pretrained': model_pretrained,
    'ResNet_Frozen': model_resnet_frozen,
    'ResNet_Partial': model_resnet_partial,
    'ResNet_Full': model_resnet_full,
}

best_model = model_map.get(best_key, model_resnet_full)
best_model.eval()

all_preds, all_labels = [], []
with torch.no_grad():
    for images, labels in test_loader:
        outputs = best_model(images.to(device))
        all_preds.extend(outputs.argmax(1).cpu().numpy())
        all_labels.extend(labels.numpy())

test_acc = np.mean(np.array(all_preds) == np.array(all_labels))
print(f"Test Accuracy: {test_acc:.4f}")

# Confusion matrix
cm = confusion_matrix(all_labels, all_preds)
plt.figure(figsize=(14, 12))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=class_names_dataset, yticklabels=class_names_dataset)
plt.xlabel('Predicted'); plt.ylabel('True')
plt.title(f'Confusion Matrix — {best_key} (Test Acc: {test_acc:.4f})')
plt.xticks(rotation=45, ha='right', fontsize=8)
plt.yticks(fontsize=8)
plt.tight_layout()
plt.show()

In [None]:
# Per-class metrics
print(classification_report(all_labels, all_preds, target_names=class_names_dataset))

report = classification_report(all_labels, all_preds,
                               target_names=class_names_dataset, output_dict=True)
f1_scores = {cls: report[cls]['f1-score'] for cls in class_names_dataset}

plt.figure(figsize=(14, 5))
plt.bar(f1_scores.keys(), f1_scores.values(), color='teal')
plt.axhline(y=np.mean(list(f1_scores.values())), color='red', linestyle='--',
            label=f"Mean F1: {np.mean(list(f1_scores.values())):.3f}")
plt.xticks(rotation=45, ha='right', fontsize=8)
plt.ylabel('F1-Score')
plt.title('Per-Class F1 Scores')
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Grad-CAM visualization
if 'ResNet' in best_key:
    target_layer = best_model.layer4[-1]
    grad_loader = test_loader_resnet
else:
    target_layer = best_model.block5 if hasattr(best_model, 'block5') else best_model.features[-1]
    grad_loader = test_loader_cnn

visualize_gradcam(best_model, target_layer, grad_loader,
                  class_names_dataset, device, n_samples=6)

### Analysis: Test Results

The confusion matrix shows that most classes are well-separated. Some clubs with visually similar logos (similar colors or circular shapes) get confused more often.

The per-class F1 scores show which logos are easiest and hardest to classify. Clubs with very distinctive logos (unique colors, unusual shapes) have the highest F1, while clubs with more generic-looking crests are harder.

The Grad-CAM heatmaps confirm that the model is looking at the actual logo features (crests, text, symbols) rather than background artifacts. This is a good sign that the model learned meaningful features.

---
## Final Reflection

### What architectural choices mattered most

1. **Pretrained weights** — The biggest factor by far. ResNet50 with ImageNet weights outperformed everything else. This shows that features learned on a large diverse dataset transfer well even to a domain-specific task like logo classification.

2. **Network depth** — DeepCNN outperformed SimpleCNN, confirming that more layers with skip connections can learn richer representations. But the gap was much smaller than the gap between scratch and pretrained.

3. **Batch Normalization** — Essential for the deeper model. Removing it caused noticeable degradation.

4. **Regularization** — Dropout and data augmentation helped reduce overfitting, but their impact was secondary to architecture and pretraining.

### When transfer learning helped

- CIFAR-100 pretraining gave a modest boost — the low-level features transferred but the domain gap limited higher-level transfer.
- ResNet50 (ImageNet) gave the strongest results because ImageNet is massive and diverse.
- Even the frozen ResNet performed well, showing ImageNet features are broadly useful.

### What I would do differently with more time

- Try EfficientNet or Vision Transformers instead of ResNet50
- Use a logo-specific pretraining dataset (like FlickrLogos-32) instead of CIFAR-100
- Try CutMix/MixUp augmentation
- Increase resolution beyond 128x128 for custom CNNs
- Build an ensemble of the best custom CNN + ResNet50