# Vision Transformers (ViT, DeiT, Swin, DETR, SegFormer)

This notebook covers transformer-based architectures for vision, from patch embeddings and ViT to hierarchical Swin, detection with DETR, and segmentation with SegFormer.

## Learning objectives
- Understand image-to-token conversion via patch embeddings
- Learn ViT basics and hierarchical extensions (Swin)
- See how attention powers detection/segmentation (DETR/SegFormer)
- Use einops `rearrange` to reshape tensors cleanly

## References (Papers)
- Dosovitskiy et al., 2020 — "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (arXiv:2010.11929)
- Touvron et al., 2020 — "Training data-efficient image transformers & distillation through attention" (arXiv:2012.12877)
- Liu et al., 2021 — "Swin Transformer" (arXiv:2103.14030)
- Carion et al., 2020 — "End-to-End Object Detection with Transformers (DETR)" (arXiv:2005.12872)
- Xie et al., 2021 — "SegFormer" (arXiv:2105.15203)


In [None]:
# Patch embedding + einops rearrange demo
import torch
from einops import rearrange
from patch_embedding import PatchEmbedding, restore_batch_from_flat_embeddings

B, C, H, W = 2, 3, 224, 224
x = torch.randn(B, C, H, W)
pe = PatchEmbedding(img_size=224, patch_size=16, in_channels=3, embed_dim=64)
emb = pe(x)  # (B, N, C)
print('Patch embeddings:', emb.shape)

# Flatten (B, N, C) -> (B*N, C) and restore with rearrange
B_, N, C_ = emb.shape
flat = rearrange(emb, 'B N C -> (B N) C')
restored = restore_batch_from_flat_embeddings(flat, B_, N)
print('Flat:', flat.shape, 'Restored:', restored.shape)
