<a href="https://colab.research.google.com/github/ArthurCBx/PyTorch-DeepLearning-Udemy/blob/main/08_paper_replicating.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 08. Milestone Project 2: PyTorch Paper Replicating

The goal of machine learning research paper replicating is: turn a ML research paper into usable code.

In this notebook, we're going to be replicating the Vision Transformer (ViT) architecture/paper with PyTorch

## 0. Get setup

Let's import code we've previously written + required library

In [None]:
import torch
import torchvision


In [None]:
!pip install torchinfo

In [None]:
try:
  from going_modular.going_modular import data_setup, engine
except:
  !git clone https://github.com/mrdbourke/pytorch-deep-learning
  !mv pytorch-deep-learning/going_modular .
  !mv pytorch-deep-learning/helper_functions.py .
  !rm -rf pytorch-deep-learning
  from going_modular.going_modular import data_setup, engine
  from helper_functions import download_data, set_seeds, plot_loss_curves


In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

## 1. Get data

The whole goal of what we're trying to do is to replicate the ViT architecture for our FoodVision Mini problem.

To do that, we need some data.

Namely, the pizza, steak and sushi images we've been using so far.

In [None]:
image_path = download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                           destination="pizza_steak_sushi")

In [None]:
train_dir = image_path / "train"
test_dir = image_path / "test"

## 2. Create Datasets and DataLoaders

In [None]:
from going_modular.going_modular import data_setup
from torchvision import transforms
# Create image size
IMG_SIZE = 224 # Comes from Table 3 of the ViT paper

# Create transforms pipeline
manual_transforms = transforms.Compose([
    transforms.Resize((IMG_SIZE, IMG_SIZE)),
    transforms.ToTensor(),
])

print(f"Manually created transforms: {manual_transforms}")

In [None]:
# Create a batch size of 32 (the paper uses 4096 but this may be too big for our smaller problem)
BATCH_SIZE = 32

# Create DataLoaders
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(train_dir,
                                                                               test_dir,
                                                                               manual_transforms,
                                                                               BATCH_SIZE)

len(train_dataloader), len(test_dataloader), class_names

### 2.3 Visualize a single image

In [None]:
from PIL import Image
import matplotlib.pyplot as plt

# Get a batch of images
image_batch, label_batch = next(iter(train_dataloader))
print(f"Image batch shape: {image_batch.shape}")
print(f"Label batch shape: {label_batch.shape}")

# Get a single image and label from the batch
image, label = image_batch[0], label_batch[0]

plt.imshow(image.permute(1,2,0))
plt.title(class_names[label])
plt.axis(False);

## 3. Replicating ViT: Overview

Looking at a whole machine learning research paper can be intimidating.

So, in order to make it more understandable, we can break it down into smaller pieces:

* **Inputs** - What goes into the model? (in our case, image tensors)
* **Outputs** - What comes out of the model/layer/block? (in our case, we want the model to ouput image classification labels)
* **Layers** - Takes an input, manipulates it with a function (for example could be self-attention).
* **Blocks** - A collection of layers.
* **Model (or achitecture)** - A collection of blocks.

### 3.1 ViT overview: pieces of the puzzle

* Figure 1: Visual overview of the architecture
* Four equations: math equations which define the functions of each layer/block
* Table 1/3: different hyperparameters for the architecture/training.
* Text

### Figure 1
![](https://github.com/mrdbourke/pytorch-deep-learning/raw/main/images/08-vit-paper-figure-1-architecture-overview.png)

* Enbedding - learnable representation (start with random numbers and improve over time)

### Four equations

![](https://github.com/mrdbourke/pytorch-deep-learning/raw/main/images/08-vit-paper-four-equations.png)

#### Section 3.1:

**Equation 1:**
An overview of the model is depicted in Figure 1. The standard Transformer receives as input a 1D sequence of token embeddings. To handle 2D images, we reshape the image $\mathbf{x} \in \mathbb{R}^{H \times W \times C}$ into a sequence of flattened 2D patches $\mathbf{x}_p \in \mathbb{R}^{N \times\left(P^2 \cdot C\right)}$, where $(H, W)$ is the resolution of the original image, $C$ is the number of channels, $(P, P)$ is the resolution of each image patch, and $N=H W / P^2$ is the resulting number of patches, which also serves as the effective input sequence length for the Transformer. The Transformer uses constant latent vector size $D$ through all of its layers, so we flatten the patches and map to $D$ dimensions with a trainable linear projection (Eq. 1). We refer to the output of this projection as the patch embeddings.

**Equation 1:**
Position embeddings are added to the patch embeddings to retain positional information. We use standard learnable 1D position embeddings, since we have not observed significant performance gains from using more advanced 2D-aware position embeddings (Appendix D.4). The resulting sequence of embedding vectors serves as input to the encoder.

In pseudocode:

```python
x_input = [class_token, image_patch_1,...,image_patch_N] + [class_token_pos, image_patch_1_pos,...]
```

**Equations 2&3:**
The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded selfattention (MSA, see Appendix A) and MLP blocks (Eq. 2, 3). Layernorm (LN) is applied before every block, and residual connections after every block (Wang et al., 2019; Baevski \& Auli, 2019).

In pseudocode:

```python
# Equation 2
x_output_MSA_block = MSA_layer(LN_layer(x_input)) + x_input

# Equation 3
x_output_MLP_block = MLP_layer(LN_layer(x_output_MSA_block)) + x_output_MSA_block
```

**Equation 4:**
Similar to BERT's [class] token, we prepend a learnable embedding to the sequence of embedded patches $\left(\mathbf{z}_0^0=\mathbf{x}_{\text {class }}\right)$, whose state at the output of the Transformer encoder $\left(\mathbf{z}_L^0\right)$ serves as the image representation $y$ (Eq. 4). Both during pre-training and fine-tuning, a classification head is attached to $\mathbf{z}_L^0$. The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time.

* MLP = multilayer perceptron = a neural network with X number of layers
* MLP = one hidden layer at training time
* MLP = single linear layer at fine-tuning time

In pseudocode:
```python
# Equation 4
y = MLP(LN_layer(x_output_MLP_block))
```

### Table 1

![](https://github.com/mrdbourke/pytorch-deep-learning/raw/main/images/08-vit-paper-table-1.png)

* ViT-Base, ViT-Large and ViT-Huge are all different sizes of the same model architecture.
* Layers - the number of transformers encoder layers.
* Hidden size $D$ - the embedding size throughout the architecture (size of the vector containing the patch).
* MLP size - the number of hidden units/neurons in the MLP.
* Head - the number of multi-head self-attention.

## 4. Equation 1: Split data into patches and creating the class, position and patch embedding

Layers = input -> function -> output

What's the input shape?

What's the output shape?

* Input shape: (224,224,3) -> single image -> (height, width, color channels
* Output shape: ?

### 4.1 Calculate input and output shape by hand

* Input shape: $H\times{W}\times{C}$
* Outupt shape: $\mathbb{R}^{N \times\left(P^2 \cdot C\right)}$
* H = height
* W = width
* C = color channels
* P = patch size
* N = number of patches = (height * width) / p²
* D = constant latent vector size = embedding dimension (see table 1)

In [None]:
# Create example values
height = 224
width = 224
color_channels = 3
patch_size = 16

# Calculate the number of patches
number_of_patches = int((height * width) / patch_size**2)
number_of_patches

In [None]:
# Input shape
embedding_layer_input_shape = (height, width, color_channels)

# Output_shape
embedding_layer_output_shape = (number_of_patches, patch_size**2 * color_channels)

print(f"Input shape (single 2D image): {embedding_layer_input_shape}")
print(f"Output shape (single 1D sequence of patches): {embedding_layer_output_shape} -> (number_of_patches, embedding_dimension)")

### 4.2 Turning a single image into patches

Let's visualize,visualize,visualize

In [None]:
# View a single image
plt.imshow(image.permute(1,2,0))
plt.title(class_names[label])
plt.axis(False);

In [None]:
image.shape

In [None]:
# Get the top row of the image
image_permuted = image.permute(1,2,0) # Convert image to color channles last (H,W,C)

# Index to plot the top row of pixels
patch_size = 16
plt.figure(figsize=(patch_size,patch_size))
plt.imshow(image_permuted[:patch_size, :, :])
plt.axis(False);

In [None]:
# Setup code to plot top row as patches
img_size = 224
patch_size = 16
num_patches = img_size/patch_size
assert img_size % patch_size == 0, "Image size must be divisible by patch size"
print(f"Number of patches per row: {num_patches}\nPatch size: {patch_size}")

# Create a series of subplots
fig, axis = plt.subplots(nrows=1,
                        ncols=img_size // patch_size, # One column for each patch
                        sharex=True,
                        sharey=True,
                        figsize=(patch_size,patch_size))

# Iterate through number of patches in the top row

for i, patch in enumerate(range(0,img_size, patch_size)):
  axis[i].imshow(image_permuted[:patch_size,patch:patch+patch_size,:])
  axis[i].set_xlabel(i)
  axis[i].set_xticks([])
  axis[i].set_yticks([])

In [None]:
# Setup code to plot whole image as patches
img_size = 224
patch_size = 16
num_patches = img_size/patch_size
assert img_size % patch_size == 0, "Image size must be divisible by patch size"
print(f"Number of patches per row: {num_patches}\
\nNumber of patches per column: {num_patches}\
\nTotal patches: {num_patches*num_patches}\
\nPatch size: {patch_size} pixels x {patch_size} pixels")

# Create a series of subplots
fig, axis = plt.subplots(nrows=img_size // patch_size,
                        ncols=img_size // patch_size, # One column for each patch
                        sharex=True,
                        sharey=True,
                        figsize=(num_patches,num_patches))

# Loop through height and width of image
for i, patch_height in enumerate(range(0,img_size, patch_size)):
  for j, patch_width in enumerate(range(0,img_size, patch_size)):
    # Plot the permuted image on the different axes
    axis[i,j].imshow(image_permuted[patch_height:patch_height+patch_size,
                                    patch_width:patch_width+patch_size, # iterate through width
                                    :]) # Get all color channels
    # Set up label information for each subplot (patch)
    axis[i, j].set_ylabel(i+1,
                          rotation="horizontal",
                          horizontalalignment="right",
                          verticalalignment="center")

    axis[i, j].set_xlabel(j+1)
    axis[i, j].set_xticks([])
    axis[i, j].set_yticks([])
    axis[i,j].label_outer()

fig.suptitle(f"{class_names[label]} -> Patchfied",fontsize=14)
plt.show()

### 4.3 Creating image patches and turning them into patch embeddings

Perhaps we could create the image patches and image patch embeddings in a single step using `torch.nn.Conv2d()` and setting the kernel size and stride parameters to `patch_size`.

In [None]:
# Create conv2d layer to turn image into patches of learnable feature maps (embeddings)
from torch import nn

# Set the patch size
patch_size=16

# Create a conv2d layer with hyperparameters from the ViT paper
conv2d = nn.Conv2d(in_channels=3,
                   out_channels=768, # D size from table 1 (patch_size² * 3)
                   kernel_size=patch_size,
                   stride=patch_size,
                   padding=0)

In [None]:
# View a single image
plt.imshow(image.permute(1,2,0))
plt.title(class_names[label])
plt.axis(False);

In [None]:
# Pass the image through the convolutional layer
image_out_of_conv = conv2d(image.unsqueeze(dim=0)) # Add batch dimension
print(f"Shape of the output of the convolutional layer: {image_out_of_conv.shape}")

14*14 is the total number of patches (196)

Now we've passed a single image to our `conv2d` layer, it's shape is:

```python
torch.Size([1,768,14,14]) # (batch_size, embedding_dim, feature_map_height, feature_map_width)
```

In [None]:
# Plot random convolutional feature maps (embeddings)
import random
random_indexes = random.sample(range(0,758), k =5)
print(f"Showing random convolutional feature maps from indexes: {random_indexes}")

# Create a plot
fig, axis = plt.subplots(nrows=1,ncols=5,figsize=(12,12))

# Plot random image feature maps
for i, idx in enumerate(random_indexes):
  image_conv_feature_map = image_out_of_conv[:,idx,:,:] # index on the output tensor of the conv2d layer
  axis[i].imshow(image_conv_feature_map.squeeze().detach().numpy()) # remove batch dimension and remove from grad tracking/switch to numpy for matplotlib
  axis[i].set(xticklabels=[],yticklabels=[],xticks=[],yticks=[])

In [None]:
# Get a single feature map in tensor form
single_feature_map = image_out_of_conv[:,0,:,:]
single_feature_map

### 4.4 Flattening the patch embedding with `torch.nn.Flaten()`

Right now we've got a series of convolutional feature maps (patch embeddings) that we want to flatten into a sequence of patch embeddings to satisfy the input criteria of the ViT Transformer Encoder.

In [None]:
print(f"{image_out_of_conv.shape} -> (batch_size, embedding_dim, feature_map_height, feature_map_width)")

Want: (batch_size, number_of_patches, embedding_dim)

In [None]:
from torch import nn
flatten_layer = nn.Flatten(start_dim=2,
                            end_dim=3)
flatten_layer(image_out_of_conv).shape

In [None]:
# Put everything together
plt.imshow(image.permute(1,2,0))
plt.title(class_names[label])
plt.axis(False)
print(f"Original image shape: {image.shape}")

#Turn image into feature maps
image_out_of_conv = conv2d(image.unsqueeze(dim=0)) # add batch dimension
print(f"Image feature map (patches) shape: {image_out_of_conv.shape}")

# Flatten the feature maps
image_out_of_conv_flattened = flatten_layer(image_out_of_conv)
print(f"Flattened image feature map shape: {image_out_of_conv_flattened.shape}")

In [None]:
# Rearrange output of flattened layer
image_out_of_conv_flattened_permuted = image_out_of_conv_flattened.permute(0,2,1)
print(f"{image_out_of_conv_flattened_permuted.shape} -> (batch_size, number_of_patches, embedding_dim)")

In [None]:
# Get a single flattened feature map
single_flattened_feature_map = image_out_of_conv_flattened_permuted[: ,: ,0]

# Plot the flattened feature map visually
plt.figure(figsize=(22,22))
plt.imshow(single_flattened_feature_map.detach().numpy())
plt.axis(False)
plt.title(f"Single flattened feature map shape: {single_flattened_feature_map.shape}");

### 4.5 Turning the ViT patch embedding layer into a PyTorch module

We want this module to do a few things:
1. Create a class called `PatchEmbedding` that inherits from `nn.Module`.
2. Initialize with appropriate hyperparameters, such as channels, embedding dimension, patch size.
3. Create a layer to turn an image into embedded patches using `nn.Conv2d()`.
4. Create a layer to flatten the feature maps of the output of the layer in 3.
5. Define a `forward()` that defines the forward computation (e.g. pass through layer from 3 and 4).
6. Make sure the output shape of the layer reflects the required output shape of the patch embedding.

In [None]:
# 1. Create a class called PatchEmbedding
class PatchEmbedding(nn.Module):
  # 2. Initialize the layer with appropriate hyperparameters
  def __init__(self,
               input_size: int=3,
               patch_size: int=16,
               embedding_dim: int=768):
    super().__init__()

    # 3. Create a layer to turn an image
    self.patcher = nn.Conv2d(in_channels=input_size,
                             out_channels=embedding_dim,
                             kernel_size=patch_size,
                             stride=patch_size,
                             padding=0)

    # 4. Create a layer to flatten feature map outputs of Conv2d
    self.flatten = nn.Flatten(start_dim=2,
                              end_dim=3)

    self.patch_size = patch_size

  # 5. Define a forward method to define the forward computation steps
  def forward(self,x):
    # Create assertion to check that inputs are the correct shape
    image_resolution = x.shape[-1]
    assert image_resolution % patch_size == 0, f"Input image size must be divisible by patch size, image shape: {image_resolution}, patch size: {patch_size}"

    # Perform the forward pass
    x =  self.flatten(self.patcher(x))

    # 6. Make sure the returned sequence embedding dimensions are in the right order (batch_size, number_of_patches, embedding dimension)
    return x.permute(0,2,1)

In [None]:
set_seeds()

# Create an instance of patch embedding layer
patchify = PatchEmbedding()

# Pass a single image through patch embedding layer
print(f"Input image size: {image.unsqueeze(0).shape}")
patch_embedded_image = patchify(image.unsqueeze(0))
print(f"Output patch embedded sequence shape: {patch_embedded_image.shape}")

### 4.6 Creating the class token embedding

Want to: prepend a learnable class token to the start of the patch embedding.

In [None]:
patch_embedded_image.shape

In [None]:
# Get the batch size and embedding dimension
batch_size = patch_embedded_image.shape[0]
embedding_dim = patch_embedded_image.shape[-1]
print(f"Batch size: {batch_size}, embedding dimension: {embedding_dim}")

In [None]:
# Create a class token embedding as a learnable parameter that shares the same size as the embedding dimension (D)
class_token = nn.Parameter(torch.ones(batch_size,1,embedding_dim,requires_grad=True))
class_token.shape

In [None]:
# Add the class token embedding to the front of the patch embedding
patch_embedded_image_with_class_embedding = torch.cat((class_token,patch_embedded_image),
                                                      dim=1) # Number_of_patches dimension

print(patch_embedded_image_with_class_embedding)
print(f"Sequence of patch embeddings with class token prepended shape{patch_embedded_image_with_class_embedding.shape} -> (batch_size, class token + number_of_patches, embedding_dim)")

The first line represents the class of the image. Usually it'll start with random values to be optimized, but in our example it starts with ones just to make it more visuable.

### 4.7 Creating the position embedding

Want to: create a series of 1D learnable position embeddings and to add them to the sequence of patch embeddings.

In [None]:
# View the sequence of patch embeddings with the prepended class embedding
patch_embedded_image_with_class_embedding, patch_embedded_image_with_class_embedding.shape

In [None]:
# Calculate N (number of patches)
number_of_patches = int((height*width) / patch_size**2)

# Get the embedding dimension
embedding_dimension = patch_embedded_image_with_class_embedding.shape[-1]

# Create the learnable 1D position embedding
position_embedding = nn.Parameter(torch.ones(1,
                                             number_of_patches+1,
                                             embedding_dimension),
                                  requires_grad=True)
position_embedding,position_embedding.shape

In [None]:
# Add the position embedding to the patch and class token embedding
patch_and_position_embedding = patch_embedded_image_with_class_embedding + position_embedding
patch_and_position_embedding, patch_and_position_embedding.shape

### 4.8 Putting it all together: from image to embedding

We've written code to turn an image in a flattened sequence of patch embeddings.

Now let's see it all in one cell.

In [None]:
set_seeds()

# 1. Set the patch size
patch_size = 16

# 2. Print shapes of the original image tensor and get the image dimensions
print(f"Image tensor shape: {image.shape}")
width, height = image.shape[-1], image.shape[2]

# 3. Get the image tensor and add the batch dimension
x = image.unsqueeze(0)
print(f"Input image shape: {x.shape}")

# 4. Create patch embedding layer
patch_embedding_layer = PatchEmbedding(input_size=3,
                                       patch_size=patch_size,
                                       embedding_dim=768)

# 5. Pass input image through PatchEmbedding
patch_embedding = patch_embedding_layer(x)
print(f"Patch embedding shape: {patch_embedding.shape}")

# 6. Get the class token embedding
batch_size = patch_embedding.shape[0]
embedding_dimension = patch_embedding.shape[-1]
class_token = nn.Parameter(torch.ones(batch_size,1,embedding_dimension),
                           requires_grad=True)
print(f"Class token embedding shape: {class_token.shape}")

# 7. Prepend the class token embedding to the patch embedding
patch_embedding_class_token = torch.cat((class_token, patch_embedding),dim=1)
print(f"Patch embedding with class token shape: {patch_embedding_class_token.shape}")

# 8. Create position embedding
number_of_patches = int((height*width) / patch_size**2)
position_embedding = nn.Parameter(torch.ones(1,
                                             number_of_patches+1,
                                             embedding_dimension),
                                  requires_grad=True)

# 9. Add the position embedding to patch embedding with class token
patch_embedding_class_token_position = patch_embedding_class_token + position_embedding
print(f"Patch embedding with class token and position embedding shape: {patch_embedding_class_token_position.shape}")

## Equation 2: Multihead Self-Attention (MSA Block)

* **Multihead self-attention**: wich part of a sequence should pay the most attention to itself?
 * In our case, we have a series of embedded image patches, wich patch significantly relates to another patch.
 * We want our neural network (ViT) to learn this relationship/representation.
* To replicate MSA in PyTorch we can use: https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html
* **LayerNorm** = Layer normalization is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy.
  * Normalization = make everything have the same mean and same standard deviation.
  * In Pytorch = https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html
normalizes values over $D$ dimension, in our case, the $D$ dimension is the embedding dimension.
    * When we normalize along the embedding dimension, it's like making all of the stairs in a staircase the same size.

In [None]:
class MultiHeadSelfAttentionBlock(nn.Module):
  """Creates a multi-head self-attention block ("MSA Block").
  """
  def __init__(self,
               embedding_dim: int=768, # Hidden size D (embedding dimension) from Table 1
               num_heads: int=12, # Heads from Table 1
               attn_dropout: int=0):
    super().__init__()

    # Create the norm layer (LN)
    self.norm_layer = nn.LayerNorm(normalized_shape=embedding_dim)

    # Create multihead attention (MSA) layer
    self.multihead_layer = nn.MultiheadAttention(embed_dim=embedding_dim,
                                                 dropout=attn_dropout,
                                                 num_heads=num_heads,
                                                 batch_first=True) # We're using (batches, num_of_patches, embedding_dim)
                                                                   # In pytorch (batch, seq, feature)
  def forward(self,x):
    x = self.norm_layer(x)
    attn_dropout, _ = self.multihead_layer(query=x,
                                           key=x,
                                           value=x,
                                           need_weights=False)
    return attn_dropout

In [None]:
# Create an instance of the MSA block
multi_head_self_attention_block = MultiHeadSelfAttentionBlock(embedding_dim=embedding_dim,
                                                              num_heads=12,
                                                              attn_dropout=0)

# Pass the patch and position image embedding sequence through MSA block
patched_image_through_msa_block = multi_head_self_attention_block(patch_and_position_embedding)
print(f"Input shape of MSA block: {patch_and_position_embedding.shape}")
print(f"Output shape of MSA block: {patched_image_through_msa_block.shape}")

In [None]:
patch_and_position_embedding,patched_image_through_msa_block

## 6. Equation 3: Multilayer Perceptron (MLP Block)

* **MLP** = The MLP contains two layers with a GELU non-linearity (section 3.1).
  * MLP = a quite broad term for block with a series of layer(s). Layers can be multiple or even only one hidden layer.
  * Layers can mean: fully-connected, dense, linear, feed-forward, all are often similar names for the same thing. In PyTorch, they're often called `torch.nn.Linear()`.
  * MLP number of hidden units = MLP Size in table 1.
* **Dropout** = Dropout, when used, is applied after every dense (linear) layer except for the qvk-projections and directly after adding positional- to patch embeddings. Hybrid models are trained with the exact setup as their ViT counterparts.
  * Value for Dropout available in Table 3

  In pseudocode:
  ```python
  #MLP
  x = linear -> non-linear -> dropout -> linear -> dropout
  ```

In [None]:
class MLPBlock(nn.Module):
  def __init__(self,
               embedding_dim: int=768,
               mlp_size: int=3072,
               dropout: int=0.1):
    super().__init__()

    # Create the norm layer (LN)
    self.layer_norm = nn.LayerNorm(normalized_shape=embedding_dim)

    # Create the MLP
    self.mlp = nn.Sequential(
        nn.Linear(in_features=embedding_dim,
                  out_features=mlp_size),
        nn.GELU(),
        nn.Dropout(p=dropout),
        nn.Linear(in_features=mlp_size,
                  out_features=embedding_dim),
        nn.Dropout(p=dropout)
    )
  def forward(self,x):
    return self.mlp(self.layer_norm(x))


In [None]:
# Create an instance of MLPBlock
mlp_block = MLPBlock(embedding_dim=embedding_dim,
                     mlp_size=3072,
                     dropout=0.1)
# Pass the output of the MSABlock through the MLPBlock
patched_image_through_mlp_block = mlp_block(patched_image_through_msa_block)
print(f"Input shape of MLP block: {patched_image_through_msa_block.shape}")
print(f"Output shape of MLP block: {patched_image_through_mlp_block.shape}")

In [None]:
patched_image_through_msa_block,patched_image_through_mlp_block

### 7. Creating the Transformer Encoder

The Transformer Encoder is a combination of alternating blocks of MSA and MLP

And there are residual connections between each block.

* Encoder = turn a sequence into learnable representation.
* Decoder = go from learn representation back to some sort of sequence.
* Residual connections = add a leayer(s) input to its suibsequent output, this enables the creation of deeper networks (prevents weights from getting too small)

In pseudocode:
```python
# Transformer Encoder
x_input -> MSA block -> [MSA_block_output + x_input] -> MLP_block -> [MLP_block_output + x_input] -> ...
```

### 7.1 Create a custom Transformer Encoder Block

In [None]:
class TransformerEncoder(nn.Module):
  def __init__(self,
               embedding_dim: int=768,
               num_heads: int=12,
               mlp_size: int=3072,
               mlp_dropout: int=0.1,
               attn_dropout: int=0,
  ):
    super().__init__()

    # Create MSA block (equation 2)
    self.msa_block = MultiHeadSelfAttentionBlock(embedding_dim=embedding_dim,
                                                 num_heads=num_heads,
                                                 attn_dropout=attn_dropout)

    # Create MLP block (equation 3)
    self.mlp_block = MLPBlock(embedding_dim=embedding_dim,
                              mlp_size=mlp_size,
                              dropout=mlp_dropout)

  def forward(self,x):
    x = x + self.msa_block(x) # residual/skip connection for equation2
    x = x + self.mlp_block(x) # residual/skip connection for equation3
    return x

In [None]:
# Create an instance of TransformerEncoderBlock()
transformer_encoder_block = TransformerEncoder()

# Get a summary with torchinfo
import torchinfo
torchinfo.summary(model=transformer_encoder_block,
                  input_size=(1,197,768),
                  col_names=["input_size","output_size","num_params","trainable"],
                  col_width=20,
                  row_settings=["var_names"])

### 7.2 Create a Transformer Encoder layer with in-built PyTorch layers

So far we've created a transformer encoder by hand.

But because of how good the Transformer architecture, PyTorch has implemented ready to use Transformer Encoder layers: https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html

We can create a Transformer Encoder with pure PyTorch layers.

In [None]:
# Create the same as above with torch.nn.TransformerEncoderLayer()
trasnformer_encoder_layer = nn.TransformerEncoderLayer(d_model=768, # embedding_dim
                                                        nhead=12, # heads from table 1
                                                        dim_feedforward=3072, # MLP size from table
                                                        dropout=0.1,
                                                        activation="gelu",
                                                       batch_first=True,
                                                       norm_first=True)
trasnformer_encoder_layer

In [None]:
torchinfo.summary(model=trasnformer_encoder_layer,
                  input_size=(1,197,768),
                  col_names=["input_size","output_size","num_params","trainable"],
                  col_width=20,
                  row_settings=["var_names"])

## 8. Putting it all together to create ViT

In [None]:
# Create a ViT class
class ViT(nn.Module):
  def __init__(self,
               img_size: int=224, # Table 3 from ViT paper
               in_channels: int=3,
               patch_size: int=16,
               num_transformer_layers: int=12, # Table 1 for "Layers" for ViT-Base
               embedding_dim: int=768, # Hidden size D from Table 1 for Vit-Base
               num_heads: int=12,
               mlp_size: int=3072,
               mlp_dropout: int=0.1,
               attn_dropout: int=0,
               embedding_dropout: int=0.1,
               num_classes: int=1000
               ):
    super().__init__()

    # Make an assertion that the image size is compatible with the patch size
    assert img_size % patch_size == 0, f"Image size must be divisible by patch size, image size: {img_size}, patch size: {patch_size}"

    # Calculate the number of patches (height * width/patch²)
    self.num_patches = (img_size*img_size) // patch_size**2

    # Create learnable class embedding (needs to got at front of sequence)
    self.class_embedding = nn.Parameter(torch.randn(1,1,embedding_dim),
                                        requires_grad=True)

    # Create learnable position embedding
    self.position_embedding = nn.Parameter(data=torch.randn(1,self.num_patches+1, embedding_dim),
                                           requires_grad=True)

    # Create embedding dropout value
    self.embedding_dropout = nn.Dropout(p=embedding_dropout)

    # Create patch embedding layer
    self.patch_embedding = PatchEmbedding(input_size=in_channels,
                                          patch_size=patch_size,
                                          embedding_dim=embedding_dim)

    # Create the Transformer Encoder Block
    self.transformer_encoder = nn.Sequential(*[TransformerEncoder(embedding_dim=embedding_dim,
                                                                  num_heads=num_heads,
                                                                  mlp_size=mlp_size,
                                                                  mlp_dropout=mlp_dropout) for _ in range(num_transformer_layers)])

    # Create classifier head
    self.classifier = nn.Sequential(
        nn.LayerNorm(normalized_shape=embedding_dim),
        nn.Linear(in_features=embedding_dim,
                  out_features=num_classes)
    )

  def forward(self,x):
    # Get the batch size
    batch_size = x.shape[0]

    # Create class token embedding and expand it to match the batch size
    class_token = self.class_embedding.expand(batch_size,-1,-1) # "-1" means to infer the dimensions

    # Create the patch embedding (equation 1)
    x = self.patch_embedding(x)

    # Concat class token embedding and patch embedding (equation 1)
    x = torch.cat((class_token,x),dim=1)

    # Add position embedding to class token and patch embedding
    x = x + self.position_embedding

    # Apply dropout to patch embedding ("directly after adding positional - to patch embeddings")
    x = self.embedding_dropout(x)

    # Pass position and patch embedding to Transformer Encoder (equation 2 & 3)
    x = self.transformer_encoder(x)

    # Put 0th index logit through classifier (equation 4)
    return self.classifier(x[:,0])



In [None]:
vit = ViT()
vit

In [None]:
set_seeds()

# Create a random image tensor with same shape as a single image
random_image_tensor = torch.randn(1,3,224,224)

# Create an instance of ViT with the number of classes we're working with
vit = ViT(num_classes=len(class_names))

# Pass the random image tensor to our ViT instance
vit(random_image_tensor)

# Pass the random image tensor through ViT
vit(random_image_tensor)

### 8.1 Getting a visual summary of our ViT model

In [None]:
torchinfo.summary(model=ViT(num_classes=3),
                  input_size=(1,3,224,224),
                  col_names=["input_size","output_size","num_params","trainable"],
                  col_width=20,
                  row_settings=["var_names"])

## 9. Setting up training code for our custom ViT

We've replicated the ViT architecture, now let's see how it performs on our FoodVision Mini data.

### 9.1 Creating an optimizer

The paper states it uses the Adam optimizer (section 4, Training & fine-tuning)
With $B1$ value of 0.9, $B2$ of 0.999 (defaults) and a weight decay of 0.1.

Weight decay = Weight decay is a regularization technique by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function.

Regularization technique = prevents overfitting.

In [None]:
optimizer = torch.optim.Adam(params=vit.parameters(),
                             lr=0.001,
                             weight_decay=0.1,
                             betas = (0.9,0.999))

### 9.2 Creating a loss function

The ViT paper doesn't actually mention what loss function they used.

So since it's a multi-class classification we'll use the `torch.nn.CrossEntropyLoss()`.

In [None]:
loss_fn = nn.CrossEntropyLoss()

### 9.3 Training our ViT Model

In [None]:
from going_modular.going_modular import engine

results = engine.train(model=vit,
                       train_dataloader=train_dataloader,
                       test_dataloader=test_dataloader,
                       optimizer=optimizer,
                       loss_fn=loss_fn,
                       epochs=10,
                       device=device)

### 9.4 What our training setup is missing

How is our training setup different to the ViT paper?

We've replicated model architecture correctly.

But what was different between our training procedure (to get such poor results) and the ViT paper training procedure to get such great results?

The main things our training implementation is missing:

Prevent underfitting:
* Data - our setup uses fas less data (225 vs millions)

Prevent overfitting:
* Learning rate warmup - start with a low learning rate and increase to a base LR
* Leaning rate decay - as our model gets closer to convergence, start to lower the learning rate
* Gradient clipping - prevent gradients from getting to0 big

### 9.5 Plotting loss curver for our model

In [None]:
from helper_functions import plot_loss_curves
plot_loss_curves(results)

## 10. Using a pretrained ViT from `torchvision.models`

Generally, in deep learning if you can use a pretrained model from a large dataset on your own problem, it's often a good place to start.

If you can fin a pretrained model and use transfer learning, give it a go, it often achieves great results with little data.

### 10.1 Why use a pretrained model?
* Sometimes data is limited
* Limited training resources
* Get better results faster (sometimes) ...

### 10.2 Prepare a pretrained ViT for use with FoodVision Mini (turn it into a feature extractor)

In [None]:
from torchvision import models
weights = models.ViT_B_16_Weights.DEFAULT
model = models.vit_b_16(weights=weights).to(device)

In [None]:
for param in model.parameters():
  param.requires_grad = False
model.heads = nn.Sequential(
    nn.Linear(in_features=768,
              out_features=len(class_names)).to(device)
)

In [None]:
torchinfo.summary(model=model)

### 10.3 Preparing data for the pretrained ViT model

When using a pretrained model, you want to make sure your data is in the same format as the data the model was trained on.

In [None]:
# Get automatic transforms from pretrained ViT weights
vit_transforms = weights.transforms()
vit_transforms

In [None]:
# Setup dataloader

train_dataloader_pretrained, test_dataloader_pretrained, class_names = data_setup.create_dataloaders(train_dir= train_dir,
                                                                                                     test_dir= test_dir,
                                                                                                     transform= vit_transforms,
                                                                                                     batch_size=32)

In [None]:
optimizer = torch.optim.Adam(params=model.parameters(),
                             lr=0.001)
results = engine.train(model=model,
            train_dataloader=train_dataloader_pretrained,
            test_dataloader=test_dataloader_pretrained,
            optimizer=optimizer,
            loss_fn=loss_fn,
            epochs=10,
            device=device)

### 10.4 Plot the loss curves of our pretrained ViT feature exctrator model

In [None]:
plot_loss_curves(results)

### 10.5 Save our best perform ViT model

Now we've got a model that performs quite well, how about we save it to file and check it's filesize.

We want to check the filesize because if we wanted to deploy a model to say a website/mobile application, we may have limitations on the size of the model we can deploy.

E.g. a smaller model may be required due to compute restrictions.

In [None]:
from going_modular.going_modular import utils

utils.save_model(model=model,
                 target_dir="models",
                 model_name="08_pretrained_vit_feature_exctractor_pizza_steak_sushi.pth")

In [None]:
from pathlib import Path

# Get the model size in bytes then convert to megabytes
pretrained_vit_model_size = Path("models/08_pretrained_vit_feature_exctractor_pizza_steak_sushi.pth").stat().st_size // (1024*1024)

print(f"Pretrained ViT feature extractor model size: {pretrained_vit_model_size}MB")

Our pretrained ViT gets some of the best results we've seen so far on our FoodVision Mini problem. However, the model size is ~11x larger than our next best performing model.

Perhaps the larger model size might cause issues when we go to deploy it (e.g. hard to deploy such a large file/might not make predictions as fast as a smaller model).

## 11. Predicting on a custom image

In [None]:
from going_modular.going_modular import predictions
predictions.pred_and_plot_image(model=model,class_names=class_names,image_path="imagem_test.png")

## Exercises


In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

In [None]:
image_path = download_data("https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                           destination="pizza_steak_sushi")
image_path

In [None]:
train_dir = image_path / "train"
test_dir = image_path / "test"

In [None]:
import torchvision
from torchvision import transforms
IMG_SIZE = 224

manual_transforms = transforms.Compose([
    transforms.Resize((IMG_SIZE,IMG_SIZE)),
    transforms.ToTensor()
])
manual_transforms

In [None]:
BATCH_SIZE = 32

train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(
    train_dir = train_dir,
    test_dir = test_dir,
    transform = manual_transforms,
    batch_size = BATCH_SIZE,
)
train_dataloader, test_dataloader, class_names

In [None]:
from torch import nn
class PatchEmbedding(nn.Module):
    """Turns a 2D input image into a 1D sequence learnable embedding vector.

    Args:
        in_channels (int): Number of color channels for the input images. Defaults to 3.
        patch_size (int): Size of patches to convert input image into. Defaults to 16.
        embedding_dim (int): Size of embedding to turn image into. Defaults to 768.
    """
    # 2. Initialize the class with appropriate variables
    def __init__(self,
                 in_channels:int=3,
                 patch_size:int=16,
                 embedding_dim:int=768):
        super().__init__()

        self.patch_size = patch_size

        # 3. Create a layer to turn an image into patches
        self.patcher = nn.Conv2d(in_channels=in_channels,
                                 out_channels=embedding_dim,
                                 kernel_size=patch_size,
                                 stride=patch_size,
                                 padding=0)

        # 4. Create a layer to flatten the patch feature maps into a single dimension
        self.flatten = nn.Flatten(start_dim=2, # only flatten the feature map dimensions into a single vector
                                  end_dim=3)

    # 5. Define the forward method
    def forward(self, x):
        # Create assertion to check that inputs are the correct shape
        image_resolution = x.shape[-1]
        assert image_resolution % self.patch_size == 0, f"Input image size must be divisble by patch size, image shape: {image_resolution}, patch size: {self.patch_size}"

        # Perform the forward pass
        x_patched = self.patcher(x)
        x_flattened = self.flatten(x_patched)
        # 6. Make sure the output shape has the right order
        return x_flattened.permute(0, 2, 1) # adjust so the embedding is on the final dimension [batch_size, P^2•C, N] -> [batch_size, N, P^2•C]

In [None]:
transformer_encoder_layer = nn.TransformerEncoderLayer(d_model=768,
                                                       nhead=12,
                                                       dim_feedforward=3072,
                                                       dropout=0.1,
                                                       activation="gelu",
                                                       batch_first=True)
transformer_encoder_layer

In [None]:
transformer_encoder = nn.TransformerEncoder(encoder_layer = transformer_encoder_layer,
                                            num_layers=12)

In [None]:
class ViT(nn.Module):
  def __init__(self,
               img_size: int=224,
               num_channels: int=3,
               patch_size: int=16,
               embedding_dim=768,
               dropout=0.1,
               mlp_size=3072,
               num_transformer_layer=12,
               num_heads=12,
               num_classes=1000):

    super().__init__()

    assert img_size % patch_size == 0, f"Image size must be divisible by patch size."

    # 1. Create patch embedding
    self.patch_embedding = PatchEmbedding(in_channels=num_channels,
                                          patch_size=patch_size,
                                          embedding_dim=embedding_dim)

    # 2. Create class token
    self.class_token = nn.Parameter(torch.randn(1,1,embedding_dim),
                                    requires_grad=True)

    # 3. Create positional embedding
    num_patches = (img_size*img_size) // patch_size**2
    self.positional_embedding = nn.Parameter(torch.rand(1,num_patches+1,embedding_dim))

    # 4. Create patch + position embedding dropout
    self.embedding_dropout = nn.Dropout(p=dropout)

    # 5. Create Transformer Encoder layer (single)
    #self.transformer_encoder_layer = nn.TransformerEncoderLayer(d_model=embedding_dim,
    #                                                            nhead=num_heads,
    #                                                            dim_feedforward=mlp_size,
    #                                                            activation="gelu",
    #                                                            batch_first=True,
    #                                                            norm_first=True)

    # 6. Create stack Transformer Encoder layers
    self.transformer_encoder = nn.TransformerEncoder(encoder_layer=nn.TransformerEncoderLayer(d_model=embedding_dim,
                                                                                              nhead=num_heads,
                                                                                              dim_feedforward=mlp_size,
                                                                                              activation="gelu",
                                                                                              batch_first=True,
                                                                                              norm_first=True),
                                                    num_layers=num_transformer_layer)

    # 7. Create MLP head
    self.mlp_head = nn.Sequential(
        nn.LayerNorm(normalized_shape=embedding_dim),
        nn.Linear(in_features=embedding_dim,
                  out_features=num_classes)
    )

  def forward(self, x):
    batch_size = x.shape[0]

    x = self.patch_embedding(x)

    class_token = self.class_token.expand(batch_size, -1, -1) # "-1" means to infer the dimension

    x = torch.cat((class_token,x),dim=1)

    # Add the positional embedding to the patch embedding with class token
    x = self.positional_embedding + x

    x = self.embedding_dropout(x)

    x = self.transformer_encoder(x)

    return self.mlp_head(x[:,0])


In [None]:
import torchinfo
torchinfo.summary(ViT(num_classes=3),
                  input_size=(1,3,224,224))

### 2. Turn the custom ViT architecture we created in Python script, for example, `vit.py`

In [None]:
%%writefile vit.py
import torch
from torch import nn

class PatchEmbedding(nn.Module):
    """Turns a 2D input image into a 1D sequence learnable embedding vector.

    Args:
        in_channels (int): Number of color channels for the input images. Defaults to 3.
        patch_size (int): Size of patches to convert input image into. Defaults to 16.
        embedding_dim (int): Size of embedding to turn image into. Defaults to 768.
    """
    # 2. Initialize the class with appropriate variables
    def __init__(self,
                 in_channels:int=3,
                 patch_size:int=16,
                 embedding_dim:int=768):
        super().__init__()

        self.patch_size = patch_size

        # 3. Create a layer to turn an image into patches
        self.patcher = nn.Conv2d(in_channels=in_channels,
                                 out_channels=embedding_dim,
                                 kernel_size=patch_size,
                                 stride=patch_size,
                                 padding=0)

        # 4. Create a layer to flatten the patch feature maps into a single dimension
        self.flatten = nn.Flatten(start_dim=2, # only flatten the feature map dimensions into a single vector
                                  end_dim=3)

    # 5. Define the forward method
    def forward(self, x):
        # Create assertion to check that inputs are the correct shape
        image_resolution = x.shape[-1]
        assert image_resolution % self.patch_size == 0, f"Input image size must be divisble by patch size, image shape: {image_resolution}, patch size: {self.patch_size}"

        # Perform the forward pass
        x_patched = self.patcher(x)
        x_flattened = self.flatten(x_patched)
        # 6. Make sure the output shape has the right order
        return x_flattened.permute(0, 2, 1) # adjust so the embedding is on the final dimension [batch_size, P^2•C, N] -> [batch_size, N, P^2•C]

class ViT(nn.Module):
  def __init__(self,
               img_size: int=224,
               num_channels: int=3,
               patch_size: int=16,
               embedding_dim=768,
               dropout=0.1,
               mlp_size=3072,
               num_transformer_layer=12,
               num_heads=12,
               num_classes=1000):

    super().__init__()

    assert img_size % patch_size == 0, f"Image size must be divisible by patch size."

    # 1. Create patch embedding
    self.patch_embedding = PatchEmbedding(in_channels=num_channels,
                                          patch_size=patch_size,
                                          embedding_dim=embedding_dim)

    # 2. Create class token
    self.class_token = nn.Parameter(torch.randn(1,1,embedding_dim),
                                    requires_grad=True)

    # 3. Create positional embedding
    num_patches = (img_size*img_size) // patch_size**2
    self.positional_embedding = nn.Parameter(torch.rand(1,num_patches+1,embedding_dim))

    # 4. Create patch + position embedding dropout
    self.embedding_dropout = nn.Dropout(p=dropout)

    # 5. Create Transformer Encoder layer (single)
    #self.transformer_encoder_layer = nn.TransformerEncoderLayer(d_model=embedding_dim,
    #                                                            nhead=num_heads,
    #                                                            dim_feedforward=mlp_size,
    #                                                            activation="gelu",
    #                                                            batch_first=True,
    #                                                            norm_first=True)

    # 6. Create stack Transformer Encoder layers
    self.transformer_encoder = nn.TransformerEncoder(encoder_layer=nn.TransformerEncoderLayer(d_model=embedding_dim,
                                                                                              nhead=num_heads,
                                                                                              dim_feedforward=mlp_size,
                                                                                              activation="gelu",
                                                                                              batch_first=True,
                                                                                              norm_first=True),
                                                    num_layers=num_transformer_layer)

    # 7. Create MLP head
    self.mlp_head = nn.Sequential(
        nn.LayerNorm(normalized_shape=embedding_dim),
        nn.Linear(in_features=embedding_dim,
                  out_features=num_classes)
    )

  def forward(self, x):
    batch_size = x.shape[0]

    x = self.patch_embedding(x)

    class_token = self.class_token.expand(batch_size, -1, -1) # "-1" means to infer the dimension

    x = torch.cat((class_token,x),dim=1)

    # Add the positional embedding to the patch embedding with class token
    x = self.positional_embedding + x

    x = self.embedding_dropout(x)

    x = self.transformer_encoder(x)

    return self.mlp_head(x[:,0])

### 3. Train a pretrained ViT feature extractor model (like the one we made in 08. PyTorch Paper Replicating section 10) on 20% of the pizza, steak and sushi data like the dataset we used in 07. PyTorch Experiment Tracking section 7.3.

In [None]:
# Create ViT extractor model
weights = torchvision.models.ViT_B_16_Weights.DEFAULT
model = torchvision.models.vit_b_16(weights=weights).to(device)

for param in model.parameters():
  param.requires_grad = False

embedding_dim = 768
model.heads = nn.Sequential(
    nn.LayerNorm(normalized_shape=embedding_dim),
    nn.Linear(in_features=768,
              out_features=len(class_names)).to(device)
)

In [None]:
# Get 20% of the data
import requests
from pathlib import Path
from zipfile import ZipFile
import os

# Setup path to data folder
data_path = Path("data/")
image_path = data_path / "pizza_steak_sushi_20_percent"

if image_path.is_dir():
  print(f"{image_path} directory exists.")
else:
  print(f"Did not find {image_path} directory, creating one...")
  image_path.mkdir(parents=True, exist_ok=True)

with open(data_path / "pizza_steak_sushi_20_percent.zip", "wb") as f:
  request = requests.get("https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi_20_percent.zip")
  f.write(request.content)

with ZipFile(data_path / "pizza_steak_sushi_20_percent.zip", "r") as zip_ref:
  print("Unzipping pizza_steak_sushi_20_percent.zip...")
  zip_ref.extractall(image_path)
  os.remove(data_path / "pizza_steak_sushi_20_percent.zip")

# Preprocess the data
train_dataloader_20_percent, test_dataloader_20_percent, class_names = data_setup.create_dataloaders(train_dir=image_path / "train",
                                                                              test_dir=test_dir,
                                                                              transform=weights.transforms(),
                                                                              batch_size=1024)
# Train model
optimizer = torch.optim.Adam(params=model.parameters(),
                             lr=0.001)
loss_fn = nn.CrossEntropyLoss()
results = engine.train(model=model,
                       train_dataloader=train_dataloader,
                       test_dataloader=test_dataloader,
                       optimizer=optimizer,
                       loss_fn=loss_fn,
                       epochs=10,
                       device=device)

In [None]:
# Examine results
from helper_functions import plot_loss_curves
plot_loss_curves(results)

## 4. Try repeating the steps from excercise 3 but this time use the "ViT_B_16_Weights.IMAGENET1K_SWAG_E2E_V1" pretrained weights from `torchvision.models.vit_b_16()`.

In [None]:
weights_swag=torchvision.models.ViT_B_16_Weights.IMAGENET1K_SWAG_E2E_V1
model_swag = torchvision.models.vit_b_16(weights=weights_swag).to(device)

for param in model_swag.parameters():
  param.requires_grad = False

embedding_dim = 768
model_swag.heads = nn.Sequential(
    nn.LayerNorm(normalized_shape=embedding_dim),
    nn.Linear(in_features=768,
              out_features=len(class_names)).to(device)
)

optimizer_swag = torch.optim.Adam(params=model_swag.parameters(),
                             lr=0.001)

train_dataloader_20_percent, test_dataloader_20_percent, class_names = data_setup.create_dataloaders(train_dir=image_path / "train",
                                                                              test_dir=test_dir,
                                                                              transform=weights_swag.transforms(),
                                                                              batch_size=32)
loss_fn = nn.CrossEntropyLoss()
results = engine.train(model=model_swag,
                       train_dataloader=train_dataloader_20_percent,
                       test_dataloader=test_dataloader_20_percent,
                       optimizer=optimizer_swag,
                       loss_fn=loss_fn,
                       epochs=10,
                       device=device)

In [None]:
plot_loss_curves(results)

In [None]:
test_dir = image_path / "test"

### Bonus: Get the "most wrong" examples from the test dataset

In [None]:
from tqdm import tqdm
from pathlib import Path
test_data_paths = list(Path(test_dir).glob("*/*.jpg"))
test_labels = [path.parent.stem for path in test_data_paths]

# Create a function to return a list of dictionaries with sample, label, prediction, pred prob
def pred_and_store(test_paths, model, transform, class_names, device):
  test_pred_list = []
  for path in tqdm(test_paths):
    # Create empty dict to store info for each sample
    pred_dict = {}

    # Get sample path
    pred_dict["image_path"] = path

    # Get class name
    class_name = path.parent.stem
    pred_dict["class_name"] = class_name

    # Get prediction and prediction probability
    from PIL import Image
    img = Image.open(path) # open image
    transformed_image = transform(img).unsqueeze(0) # transform image and add batch dimension
    model.eval()
    with torch.inference_mode():
      pred_logit = model(transformed_image.to(device))
      pred_prob = torch.softmax(pred_logit, dim=1)
      pred_label = torch.argmax(pred_prob, dim=1)
      pred_class = class_names[pred_label.cpu()]

      # Make sure things in the dictionary are back on the CPU
      pred_dict["pred_prob"] = pred_prob.unsqueeze(0).max().cpu().item()
      pred_dict["pred_class"] = pred_class

    # Does the pred match the true label?
    pred_dict["correct"] = class_name == pred_class

    # print(pred_dict)
    # Add the dictionary to the list of preds
    test_pred_list.append(pred_dict)

  return test_pred_list

test_pred_dicts = pred_and_store(test_paths=test_data_paths,
                                 model=model_swag,
                                 transform=weights_swag.transforms(),
                                 class_names=class_names,
                                 device=device)

test_pred_dicts[:5]


In [None]:
# Turn the test_pred_dicts into a DataFrame
import pandas as pd
test_pred_df = pd.DataFrame(test_pred_dicts)
# Sort DataFrame by correct then by pred_prob
top_5_most_wrong = test_pred_df.sort_values(by=["correct", "pred_prob"], ascending=[True, False]).head()
top_5_most_wrong

In [None]:
test_pred_df.correct.value_counts()

In [None]:
import torchvision
import matplotlib.pyplot as plt
# Plot the top 5 most wrong images
for row in top_5_most_wrong.iterrows():
  row = row[1]
  image_path = row[0]
  true_label = row[1]
  pred_prob = row[2]
  pred_class = row[3]
  # Plot the image and various details
  img = torchvision.io.read_image(str(image_path)) # get image as tensor
  plt.figure()
  plt.imshow(img.permute(1, 2, 0)) # matplotlib likes images in [height, width, color_channels]
  plt.title(f"True: {true_label} | Pred: {pred_class} | Prob: {pred_prob:.3f}")
  plt.axis(False);


### 5. Our custom ViT model architecture closely mimics that of the ViT paper, however, our training recipe misses a few things. Research some of the following topics from Table 3 in the ViT paper that we miss and write a sentence about each and how it might help with training:

1) ImageNet-21k pretraining (more data): help the model to identify different aspects from images, making it more general.

2) Learning rate warmup: gradual increase in the learning rate, allowing to stabilize the model in the early stage of training and to get to larger leraning rates.

3) Learning rate decay: gradual decrease in the learning rate on later stages of training.

4) Gradient clipping: prevent exploding gradients by limit their size during optimization.