# VGG
[Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/pdf/1409.1556.pdf)

"In this work we investigate the effect of the convolutional network depth on its
accuracy in the large-scale image recognition setting." 

...

"a significant improvement
on the prior-art configurations can be achieved by pushing the depth to 16–19
weight layers"

...

"To this end, we fix other parameters of the architecture, and steadily increase the
depth of the network by adding more convolutional layers, which is feasible due to the use of very
small (
3
×
3) convolution filters in all layers."

---
Segundo lugar no ImageNet em 2014 (atrás da complexa GoogLeNet), a VGG tenta explorar o limite da capacidade das arquiteturas de rede convolucional.

Muitas arquiteturas foram propostas, AlexNet em 2012, alcançando um top 5 error de 15.3%, utilizando convoluções novamente, antes utilizada somente em 1989, por Yann Lecun em LeNet, para a leitura de digitos de 0 a 9. 

ZFNet em 2013, evoluindo a arquitetura e alcançando um top 5 error de 11.2%. 

Inceptionv1 (GoogLeNet) em 2014, alcançando um top 5 error de 6.67, com uma arquitetura extremamente complexa e paralelizada, com 22 camadas, com um processo extremamente único e regulado para treinar.

E finalmente chegamos na VGG, proposta no mesmo ano de 2014, muito mais simples, com um top 5 error de 7.3, treinada em 4 GPUs por 2–3 semanas.
 

Iremos construir a versão de 16 camadas, na coluna D:

![image](https://user-images.githubusercontent.com/56324869/163493084-e6b9567e-a971-4cd5-831f-f501a95ecb35.png)


In [1]:
!nvidia-smi

Fri Apr 15 00:52:20 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   74C    P8    80W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [16]:
import copy
import os
import time

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import torchvision.models as models
import torchvision.transforms as transforms
from torchvision.datasets import CIFAR10

In [None]:
batch_size = 64

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Resize((224,224)),
     transforms.Normalize((0.485, 0.485, 0.406), (0.229, 0.224, 0.225))])

image_datasets = {x: CIFAR10(root='./data', train=True if x=="train" else False ,download=True , transform=transform) for x in ['train','val']}
dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=batch_size, shuffle=True, num_workers=2) for x in ['train', 'val']}
dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']}
class_names = image_datasets['train'].classes

In [12]:
class VGG16(nn.Module):

  def __init__(self):
      super().__init__()
      
      #2 conv3-64
      self.conv1_1 = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1)
      self.conv1_2 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, padding=1)

      #2 conv3-128
      self.conv2_1 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1)
      self.conv2_2 = nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, padding=1)
      
      #3 conv3-256
      self.conv3_1 = nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, padding=1)
      self.conv3_2 = nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, padding=1)
      self.conv3_3 = nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, padding=1)

      #3 conv3-512
      self.conv4_1 = nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, padding=1)
      self.conv4_2 = nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3, padding=1)
      self.conv4_3 = nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3, padding=1)

      #3 conv3-512
      self.conv5_1 = nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3, padding=1)
      self.conv5_2 = nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3, padding=1)
      self.conv5_3 = nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3, padding=1)

      self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2)

      self.fc1 = nn.Linear(25088, 4096)
      self.fc2 = nn.Linear(4096, 4096)
      self.fc3 = nn.Linear(4096, 10)
    
  def forward(self, x):
      x = F.relu(self.conv1_1(x))
      x = F.relu(self.conv1_2(x))

      x = self.maxpool(x)

      x = F.relu(self.conv2_1(x))
      x = F.relu(self.conv2_2(x))

      x = self.maxpool(x)

      x = F.relu(self.conv3_1(x))
      x = F.relu(self.conv3_2(x))
      x = F.relu(self.conv3_3(x))

      x = self.maxpool(x)

      x = F.relu(self.conv4_1(x))
      x = F.relu(self.conv4_2(x))
      x = F.relu(self.conv4_3(x))

      x = self.maxpool(x)

      x = F.relu(self.conv5_1(x))
      x = F.relu(self.conv5_2(x))
      x = F.relu(self.conv5_3(x))

      x = self.maxpool(x)

      x = torch.flatten(x, 1) # flatten all dims except batch
      
      x = F.relu(self.fc1(x))
      x = F.dropout(x, 0.5) # dropout was included to combat overfitting
      x = F.relu(self.fc2(x))
      x = F.dropout(x, 0.5)
      x = self.fc3(x)
      return x
     

In [4]:
# Caso disponha do tempo necessário (muito tempo com a gpu do colab), treinamento seria feito com:
epoches = 50
device = torch.device("cuda")
model = VGG16()
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=3e-4)

best_model_wts = copy.deepcopy(model.state_dict())
best_acc = 0.0
model.to(device)
start = time.time()

for epoch in range(epoches):  

    print(f"\nepoch: {epoch} / {(epoches-1)}")
    print("----------")

    epoch_loss = 0
    model.train()

    for x, y in dataloaders["train"]:
      x, y = x.to(device), y.to(device)
      optimizer.zero_grad()
      predictions = model(x)
      loss = criterion(predictions, y)
      loss.backward()
      optimizer.step()
      epoch_loss += loss.item()
    loss = epoch_loss / len(dataloaders["train"])
    print(f"epoch {epoch} training Loss: {loss:.4f}")

    with torch.no_grad():
      model.eval()
      correct = 0
      samples = 0
      for x, y in dataloaders["val"]:
        x, y = x.to(device), y.to(device)
        predictions = model(x)
        _, predictions = predictions.max(1)
        correct += (predictions == y).sum()
        samples += predictions.size(0)

      val_acc = correct/samples
      print(f"val accuracy: {val_acc}")

      if val_acc > best_acc:
        best_acc = val_acc
        best_model_wts = copy.deepcopy(model.state_dict())
  
end = time.time()
model.load_state_dict(best_model_wts)
torch.save(model.state_dict(),"vgg16.pth")
print(f"Tempo de treinamento: {(end-start):.2f}")



epoch: 0 / 49
----------
epoch 0 training Loss: 1.9267
val accuracy: 0.4235000014305115

epoch: 1 / 49
----------


KeyboardInterrupt: ignored

Vejamos agora nossa implementação comparada a do torchvision:

In [13]:
from torchsummary import summary
model = VGG16()
model.to(device)
summary(model,(3,224,244))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 224, 244]           1,792
            Conv2d-2         [-1, 64, 224, 244]          36,928
         MaxPool2d-3         [-1, 64, 112, 122]               0
            Conv2d-4        [-1, 128, 112, 122]          73,856
            Conv2d-5        [-1, 128, 112, 122]         147,584
         MaxPool2d-6          [-1, 128, 56, 61]               0
            Conv2d-7          [-1, 256, 56, 61]         295,168
            Conv2d-8          [-1, 256, 56, 61]         590,080
            Conv2d-9          [-1, 256, 56, 61]         590,080
        MaxPool2d-10          [-1, 256, 28, 30]               0
           Conv2d-11          [-1, 512, 28, 30]       1,180,160
           Conv2d-12          [-1, 512, 28, 30]       2,359,808
           Conv2d-13          [-1, 512, 28, 30]       2,359,808
        MaxPool2d-14          [-1, 512,

In [15]:
torch_vgg = models.vgg16()
torch_vgg.classifier[6] = nn.Linear(4096, 10) # Mudamos a camada final de 1000 probabilidades do ImageNet para 10 do Cifar-10 
torch_vgg.to(device)
summary(torch_vgg, (3,224,224))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 224, 224]           1,792
              ReLU-2         [-1, 64, 224, 224]               0
            Conv2d-3         [-1, 64, 224, 224]          36,928
              ReLU-4         [-1, 64, 224, 224]               0
         MaxPool2d-5         [-1, 64, 112, 112]               0
            Conv2d-6        [-1, 128, 112, 112]          73,856
              ReLU-7        [-1, 128, 112, 112]               0
            Conv2d-8        [-1, 128, 112, 112]         147,584
              ReLU-9        [-1, 128, 112, 112]               0
        MaxPool2d-10          [-1, 128, 56, 56]               0
           Conv2d-11          [-1, 256, 56, 56]         295,168
             ReLU-12          [-1, 256, 56, 56]               0
           Conv2d-13          [-1, 256, 56, 56]         590,080
             ReLU-14          [-1, 256,

Veja que o número total de parâmetros da implementação oficial do TorchVision é a mesma de nossa implementação, 134,301,514, ou seja, são equivalentes. Este modelo possui 93.42% de precisão no CIFAR-10, e 74.4% de precisão (top 1) no ImageNet.