#**Семинар 4. Visual attention**

**План занятия:**

Адаптация трансформера к задаче классификации изображений. Реализация simple ViT.


Стоит понимать базовый процесс обучения модели.
https://pytorch-lightning.readthedocs.io/en/latest/levels/core_skills.html

## Visual Transformer

[An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/pdf/2010.11929.pdf)

### Positional encoding

**Вопрос:** Как можно дать модели информацию о последовательности?

**Критерии энкодинга:**

1) Уникальное кодирование для каждого слова

2) Не должно быть разницы в дельтах между разными по длинне последовательностями

3) Обобщение на длинные предложения -> bounded значения

4) Детерминированность


Origin - [attention is all you need, part 3.5](https://arxiv.org/pdf/1706.03762.pdf)

In [None]:
PE(x,2i) = sin(x/10000^(2i/D))
PE(x,2i+1) = cos(x/10000^(2i/D))

Where:
x is a point in 1d space
i is an integer in [0, D/2), where D is the size of the ch dimension

SyntaxError: ignored

![alt text](https://drive.google.com/uc?export=view&id=1Xdq4ap-eSHjgRnz08KK4UWOmrXSOdOwY)

![alt text](https://drive.google.com/uc?export=view&id=1-DrPfHnk1fln_sGN6THRdEDDtSOD9dy1)

![alt text](https://drive.google.com/uc?export=view&id=1KKdJVRnSswPi9xOGRZK8IvTBsK5Xw_aN)

"We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PEpos+k can be represented as a linear function of PEpos."

[proof-Relative Positioning](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/)

[more examples](https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/)

Разделают понятия absolute positional encoding (APE) и relative positional encoding (RPE) - [paper](https://paperswithcode.com/method/relative-position-encodings)

[Code](https://github.com/gazelle93/Transformer-Various-Positional-Encoding)

**Positional encoding в ViT**

**Задача:** реализуйте positional_encoding_1d.

In [None]:
https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html

In [None]:
# Дано:
# positional_encoding: [1, seq_length, num_dim_to_encode]
# _2i: [num_dim_to_encode//2]
# position: [seq_length, 1]

### ViT

![alt text](https://drive.google.com/uc?export=view&id=1J5TvycDPs8pzfvlXvtO5MCFBy64yp9Fa)

In [1]:
import torch
import torch.nn as nn
from einops.layers.torch import Rearrange

In [None]:
class MLP(nn.Module):
    def __init__(self, in_features, hidden_features=None, out_features=None,
                 dropout=0.):
        super().__init__()
        ...

    def forward(self, x):
        ...
        return x

In [None]:
class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, attn_dropout=0., proj_dropout=0.):
        super().__init__()
        self.num_heads = num_heads
        self.scale = 1./dim**0.5

        self.qkv = ...
        self.attn_dropout = nn.Dropout(attn_dropout)
        self.out = ...

    def forward(self, x):
        ...
        ...
        return x

In [None]:
class ImgPatches(nn.Module):
    def __init__(self, in_ch=3, embed_dim=768, patch_size=16):
        super().__init__()
        self.patch_embed = nn.Sequential(
            nn.Conv2d(in_channels=in_ch, out_channels=embed_dim, kernel_size=patch_size, stride=patch_size),
            Rearrange('c h w -> c (h w)')           
        )

    def forward(self, img):
        patches = self.patch_embed(img)
        return patches

In [None]:
class Block(nn.Module):
    def __init__(self, dim, num_heads=8, mlp_ratio=4, drop_rate=0.):
        super().__init__()
        ...

    def forward(self, x):
        ...
        return x

In [None]:
class Transformer(nn.Module):
    def __init__(self, depth, dim, num_heads=8, mlp_ratio=4, drop_rate=0.):
        super().__init__()
        self.blocks = nn.ModuleList([
            Block(dim, num_heads, mlp_ratio, drop_rate)
            for i in range(depth)])

    def forward(self, x):
        for block in self.blocks:
            x = block(x)
        return x

In [None]:
class ViT(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_ch=3, num_classes=1000,
                 embed_dim=768, depth=12, num_heads=12, mlp_ratio=4,
                 drop_rate=0.3):
        super().__init__()

        ...

    def forward(self, x):
        return x

## Тренировка

In [None]:
# conda create --name lec5 python=3.9
# conda activate lec5
# pip install --quiet "setuptools==59.5.0" "pytorch-lightning>=1.4" "matplotlib" "torch>=1.8" "ipython[notebook]" "torchmetrics>=0.7" "torchvision" "seaborn"