# **VGG-19**

VGG19 is one of the most important CNNs. Its development has lead to tons of greater applications in the field of computer vision, specially in patern recognition. This notebook aims to give a brief resume of this model, and justify why it is used in our model

Author:  
@jwpr-dpr  
15-04-2025

## **Introduction to VGG-19**

Welcome to this tutorial on **VGG-19**, one of the most influential convolutional neural networks in computer vision! Developed by the **Visual Geometry Group (VGG)** at the University of Oxford, VGG-19 gained popularity thanks to its elegant and simple architecture built entirely from 3x3 convolution layers and 2x2 max-pooling layers.

### What is VGG-19?

VGG-19 is a deep convolutional neural network that consists of:

- 19 weight layers: **16 convolutional layers** and **3 fully connected layers**
- Small filter sizes: All convolutional layers use 3x3 filters with stride 1 and padding to preserve spatial resolution.
- Simplicity in design: Repeated stacking of simple layers rather than using complex components.

The model was introduced in the 2014 paper:  
*Very Deep Convolutional Networks for Large-Scale Image Recognition* by Karen Simonyan and Andrew Zisserman.

---

### Why Use VGG-19?

-  **Transfer learning**: Excellent for feature extraction and fine-tuning on smaller datasets.
-  **Benchmarking**: A strong baseline model for many image classification tasks.
-  **Style transfer**: VGG-19 is widely used in neural style transfer applications.

Despite being more computationally expensive compared to newer architectures, VGG-19 remains an standard to complex and extensive visul applications.


## **Architecture Overview: VGG-19**

The **VGG-19 architecture** is a deep convolutional neural network composed of 19 layers with learnable weights:
- **16 convolutional layers**
- **3 fully connected (dense) layers**

It uses **very small filters (3×3)** and **max-pooling (2×2)** layers after blocks of convolutions. Each convolution layer uses **ReLU activation**, and the final classification is done via a **Softmax layer**.

---

###  Layer Structure Summary

The input image size is typically **224×224×3** (RGB).

| Layer Block | Structure | Output Size |
|-------------|-----------|-------------|
| Input       | -         | 224×224×3   |
| Conv Block 1 | 2×(Conv3-64) + MaxPool | 112×112×64 |
| Conv Block 2 | 2×(Conv3-128) + MaxPool | 56×56×128 |
| Conv Block 3 | 4×(Conv3-256) + MaxPool | 28×28×256 |
| Conv Block 4 | 4×(Conv3-512) + MaxPool | 14×14×512 |
| Conv Block 5 | 4×(Conv3-512) + MaxPool | 7×7×512 |
| FC Layers    | Flatten → FC-4096 → FC-4096 → FC-1000 | 1000 classes (ImageNet) |

Note: Each convolution uses a 3×3 kernel, stride=1, and padding=1 (same padding).

---

###  Architectural Highlights

-  **ReLU after every conv layer**
-  **MaxPooling after each block**
-  **No BatchNorm** in original version
-  **No shortcuts or attention layers**—very straightforward!
-  No residuals, no depthwise separable convs—just plain old convs and FCs.



## **Implementing VGG-19 in Code (with PyTorch)**

This step will be quite straight forward, as we eill be using the pretrained model that exists in repositories

In [1]:
import torch
import torchvision.models as models
#from torchsummary import summary  

vgg19 = models.vgg19(pretrained=True)

# Set to evaluation mode (we won't be training it)
vgg19.eval()

# Print the full architecture
print("Full VGG-19 Architecture:\n")
print(vgg19)



Full VGG-19 Architecture:

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): Conv2d(256, 256, kernel_size=

### 🔍 Notes:
- We're loading **weights pretrained on ImageNet**.
- The `vgg19` model has two parts:
  - `features`: All convolutional and pooling layers
  - `classifier`: The three fully connected layers used for classification


## **Applications of VGG-19 in Deep Learning**

Although VGG-19 is no longer state-of-the-art, it remains **widely used in practice** due to its simplicity and effectiveness. Below, we cover 4 core use cases:

---

### Feature Extraction

We can use VGG-19 as a **feature extractor** by removing the final classification head and using the convolutional layers to obtain embeddings.

This is useful for:
- Custom classifiers
- Similarity metrics
- Clustering images by content

In [None]:
import torch
import torchvision.models as models

# Load pretrained VGG-19
vgg19 = models.vgg19(pretrained=True).features  # Only convolutional layers
vgg19.eval()

# Pass image through VGG-19 to get feature map
with torch.no_grad():
    features = vgg19(image_tensor)  # shape: [B, 512, 7, 7] for 224x224 input

### Transfer Learning 

You can tune VGG-19 on your own dataset by
* Freezing all features layers
* Replacing and training a new classifier head

In [None]:
from torch import nn

# Load full model
model = models.vgg19(pretrained=True)

# Freeze feature extractor
for param in model.features.parameters():
    param.requires_grad = False

# Replace classifier for your custom task
model.classifier = nn.Sequential(
    nn.Linear(25088, 512),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(512, 10),  # E.g. for CIFAR-10
    nn.LogSoftmax(dim=1)
)


### Image Classification 

VGG-19 was originally trined on ImageNet and can be retrained on diverse image datasets

In [None]:
from torchvision import transforms
from PIL import Image

# Preprocess image
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406],
                         [0.229, 0.224, 0.225])
])

img = Image.open("your_image.jpg")
img_tensor = transform(img).unsqueeze(0)  # Add batch dimension

# Predict
model.eval()
output = model(img_tensor)
prediction = output.argmax(dim=1)


### Style Transfer (Iconic Use Case!)
VGG-19 is the backbone of neural style transfer (NST) -This has become very famous lately, as people uses the possibility of generating images using different artists styles, so yeah, this is how they have been riping off Ghibli Studios-, where:

* One image provides content
* Another provides style
* The model mixes both by optimizing pixel values

Core idea:
* Use content loss from deeper layers (e.g., conv4_2)
* Use style loss from shallower layers (e.g., conv1_1, conv2_1, etc.)
* This is often implemented using vgg19.features and computing Gram matrices.

In [None]:
# Example: Hooking specific layers for content/style loss
vgg = models.vgg19(pretrained=True).features.eval()

# You can extract intermediate layers like this:
def get_features(x, model, layers):
    features = {}
    for name, layer in model._modules.items():
        x = layer(x)
        if name in layers:
            features[name] = x
    return features

layers = {'0': 'conv1_1', '5': 'conv2_1', '10': 'conv3_1', '19': 'conv4_1'}
features = get_features(input_img, vgg, layers)


## **Perceptual Loss with VGG-19**

Perceptual Loss (also called **Feature Reconstruction Loss**) measures the difference between high-level feature representations of two images instead of raw pixel-wise differences.

This is particularly useful in:
- Super-resolution (SRGAN)
- Style transfer
- Inpainting / Image generation
- Denoising

---

### 🧠 Key Idea

Use a **pretrained model like VGG-19**, and extract intermediate feature maps (e.g., from `conv3_3` or `conv4_2`), then compute L1 or L2 loss between those maps for two images:

$$
\mathcal{L}_{\text{perceptual}}(x, \hat{x}) = \| \phi_l(x) - \phi_l(\hat{x}) \|_2^2
$$

Where:
- $\phi_l$  is the activation at layer $ l $
- $ x $ is the ground truth image
- $ \hat{x} $ is the generated image

---

### 🛠️ Implementing Perceptual Loss with PyTorch

In [None]:
import torch
import torch.nn as nn
import torchvision.models as models

class VGGPerceptualLoss(nn.Module):
    def __init__(self, layer='conv4_2', resize=True):
        super(VGGPerceptualLoss, self).__init__()
        self.vgg_layers = models.vgg19(pretrained=True).features.eval()
        
        # Freeze the VGG weights
        for param in self.vgg_layers.parameters():
            param.requires_grad = False
        
        # Choose layer index corresponding to desired perceptual depth
        self.layer_name_mapping = {
            'conv1_1': 0,
            'conv1_2': 2,
            'conv2_1': 5,
            'conv2_2': 7,
            'conv3_1': 10,
            'conv3_2': 12,
            'conv3_3': 14,
            'conv3_4': 16,
            'conv4_1': 19,
            'conv4_2': 21,
            'conv4_3': 23,
            'conv4_4': 25,
            'conv5_1': 28
        }

        self.target_layer = self.layer_name_mapping[layer]
        self.resize = resize
        self.criterion = nn.MSELoss()

    def forward(self, x, y):
        # Resize if needed
        if self.resize:
            x = nn.functional.interpolate(x, size=(224, 224), mode='bilinear', align_corners=False)
            y = nn.functional.interpolate(y, size=(224, 224), mode='bilinear', align_corners=False)
        
        # Normalize to ImageNet stats
        mean = torch.tensor([0.485, 0.456, 0.406]).to(x.device).view(1, 3, 1, 1)
        std = torch.tensor([0.229, 0.224, 0.225]).to(x.device).view(1, 3, 1, 1)
        x = (x - mean) / std
        y = (y - mean) / std

        # Forward until target layer
        for i, layer in enumerate(self.vgg_layers):
            x = layer(x)
            y = layer(y)
            if i == self.target_layer:
                break

        # Compute perceptual loss
        return self.criterion(x, y)