# Original CvT-Model

<img src="./../CvT-Original.drawio.png?raw=1" alt="CvT-Modell mit Convolutional Embedding" title="CvT-Modell mit Convolutional Embedding" height="400" />

Dimensions sind ohne Batch-Size.

## Input-Dimensions

**Dimensions:** $H_0 = 64px, \quad W_0 = 64px, \quad C_0 = 3$ \
**Output-Shape:** `(3, 64, 64)`

## Conv2d

Berechnung Output-Dimensions:

$ \text{kernel size}\ k = 7, \quad \text{stride}\ s = 4, \quad \text{padding}\ p = 3 $ \
$ H_i = \frac{H_{i-1} + 2p - k}{s}\ + 1, \quad W_i = \frac{W_{i-1} + 2p - k}{s}\ + 1 $

**Output-Dimensions:** $H_1 = 16px, \quad W_1 = 16px, \quad C_1 = 64$ \
**Output-Shape:** `(64, 16, 16)`

## Flatten

**Output-Dimensions:** $H_1 W_1 \times C_1 = 16*16 \times 64$ \
**Output-Shape:** `(256, 64)`

## Multi-Head Attention

Berechnung der Query-, Key- und Value-Matrizen:

$X \in \mathbb{R}^{H_1 W_1 \times C_1}$ \
$d_k$ ist die Dimension der Value-, Query- und Key-Vektoren \
$W^Q, W^K, W^V \in \mathbb{R}^{C_1 \times d_k}$ \
$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$

$d_k = 64$ \
$Q, K, V \in \mathbb{R}^{256 \times 64}$

**Output-Dimensions:** $256 \times 64$ \
**Output-Shape:** `(256, 64)`

## MLP

Expansion factor: $e = 4$

1. **Step:** Linear ➔ GELU ➔ Dropout
   
   **Output-Dimensions:** $256 \times 64 \times 4 = 256 \times 256$ \
   **Output-Shape:** `(256, 256)`

2. **Step:** Linear ➔ Dropout

    **Output-Dimensions:** $256 \times 256 \times 64 = 256 \times 64$ \
    **Output-Shape:** `(256, 64)`


# Imports

In [1]:
%pip install pytorch-lightning
%pip install lightning
%pip install einops

[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import lightning as L
from einops import rearrange
import pytorch_lightning as pl
from pytorch_lightning.loggers import WandbLogger
import torch
import torch.nn.functional as F
from torchmetrics.classification import Accuracy
from pathlib import Path
import wandb

# Modell

In [3]:
class EmbeddingBlock(nn.Module):
    def __init__(self,
                 in_channels=3,
                 embed_dim=64,
                 patch_size=7,
                 stride=4,
                 padding=3,
                 norm_layer=nn.LayerNorm):
        super().__init__()
        self.proj = nn.Conv2d(
            in_channels, embed_dim,
            kernel_size=patch_size,
            stride=stride,
            padding=padding
        )
        self.norm = norm_layer(embed_dim)

    def forward(self, x):
        x = self.proj(x)
        x = rearrange(x, 'b c h w -> b (h w) c')
        x = self.norm(x)
        return x


class ConvProjection(nn.Module):
    def __init__(self, embed_dim, kernel_size=3, stride=1, padding=1):
        super().__init__()
        self.proj = nn.Conv2d(embed_dim, embed_dim, kernel_size=kernel_size,
                              stride=stride, padding=padding, groups=embed_dim)
        self.pointwise = nn.Conv2d(embed_dim, embed_dim, kernel_size=1)
        self.norm = nn.BatchNorm2d(embed_dim)

    def forward(self, x):
        B, N, C = x.shape
        H = W = int(N**0.5)
        x = x.transpose(1, 2).reshape(B, C, H, W)
        x = self.proj(x)
        x = self.pointwise(x)
        x = self.norm(x)
        x = x.flatten(2).transpose(1, 2)
        return x


class TransformerBlock(nn.Module):
    def __init__(self,
                 embed_dim=64,
                 num_heads=1,
                 mlp_ratio=4.0,
                 drop=0.1,
                 norm_layer=nn.LayerNorm):
        super().__init__()
        self.conv_proj = ConvProjection(embed_dim)
        self.norm1 = norm_layer(embed_dim)
        self.attn = nn.MultiheadAttention(embed_dim, num_heads, dropout=drop, batch_first=True)

        self.norm2 = norm_layer(embed_dim)
        hidden_dim = int(embed_dim * mlp_ratio)
        self.mlp = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(drop),
            nn.Linear(hidden_dim, embed_dim),
            nn.Dropout(drop)
        )

    def forward(self, x):
        x_proj = self.conv_proj(x)
        x = x + self.attn(self.norm1(x_proj), self.norm1(x_proj), self.norm1(x_proj))[0]
        x = x + self.mlp(self.norm2(x))
        return x

## Testing

In [4]:
input = torch.randn(2, 3, 64, 64)
excepted_output_shape = (2, 256, 64)

embedding_block = EmbeddingBlock()
transformer_block = TransformerBlock()

output = embedding_block(input)
output = transformer_block(output)

assert output.shape == excepted_output_shape, f"Expected shape {excepted_output_shape}, but got {output.shape}"
print("Output shape is as expected:", output.shape)

Output shape is as expected: torch.Size([2, 256, 64])


# Dataset

In [5]:
from models.processData import prepare_data_and_get_loaders

train_loader, val_loader, test_loader = prepare_data_and_get_loaders("/datasets/tiny-imagenet-200/tiny-imagenet-200.zip", "/datasets/tiny-imagenet-200/")

Data already extracted.
Processing validation set...


OSError: [Errno 30] Read-only file system: '/datasets/tiny-imagenet-200/tiny-imagenet-200'

### Testing

In [None]:
def imshow(img):
    img = img / 2 + 0.5
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.axis('off')
    plt.show()

image, label = train_loader.dataset[0]
imshow(image)

# Model

In [None]:
class CvTSimplifiedEmbedding(nn.Module):
    def __init__(self,
                 in_channels=3,
                 num_classes=200,
                 embed_dim=64,
                 depth=2):
        super().__init__()

        self.num_classes = num_classes

        self.embedding = EmbeddingBlock(
            in_channels=in_channels,
            embed_dim=embed_dim,
            stride=4,
            padding=3,
            norm_layer=nn.LayerNorm
        )

        self.transformer_blocks = nn.Sequential(*[
            nn.Sequential(TransformerBlock(), EmbeddingBlock()) for _ in range(depth)
        ])

        self.head = nn.Sequential(
            nn.AdaptiveAvgPool1d(1),
            nn.Flatten(1),
            nn.Linear(embed_dim, num_classes)
        )


    def forward(self, x):
        x = self.embedding(x)
        x = self.transformer_blocks(x)
        x = x.permute(0, 2, 1)
        x = self.head(x)
        return x

## Testing

In [None]:
model = CvTSimplifiedEmbedding()
dummy_input = torch.randn(8, 3, 64, 64)
output = model(dummy_input)

assert output.shape == (8, 200), f"Expected output shape (8, 200), but got {output.shape}"
print("Model output shape is as expected:", output.shape)

# Training

In [None]:
from models.trainModel import train_test_model

train_test_model(model, train_loader, val_loader, test_loader)