#  Transformer Variants in Deep Learning

---

##  Foundation & Core Transformer
- **Transformer (original)** – Vaswani et al. (2017)  
  *“Attention Is All You Need.”* NeurIPS 2017.  

---

##  Transformer Families

| **Category** | **Models** | **Authors & Year** | **Key Contributions** |
|--------------|------------|--------------------|-----------------------|
| **Encoder-only (Masked LM)** | **BERT** – Devlin et al. (2018, Google AI) | *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.* | Bidirectional masked LM pretraining. |
| | **RoBERTa** – Liu et al. (2019) | *Robustly Optimized BERT Pretraining Approach.* | More training, larger data, no NSP task. |
| | **ALBERT** – Lan et al. (2019) | *A Lite BERT.* | Weight sharing + Sentence Order Prediction. |
| | **ELECTRA** – Clark et al. (2020) | *Pre-training Text Encoders as Discriminators Rather Than Generators.* | Replaces MLM with Replaced Token Detection. |
| **Decoder-only (Autoregressive)** | **GPT** – Radford et al. (2018, OpenAI) | *Improving Language Understanding by Generative Pre-Training.* | Left-to-right autoregressive Transformer. |
| | **GPT-2** – Radford et al. (2019) | *Language Models are Unsupervised Multitask Learners.* | Large-scale autoregressive model, strong text generation. |
| | **GPT-3** – Brown et al. (2020) | *Language Models are Few-Shot Learners.* | 175B parameters, few-shot learning ability. |
| **Encoder–Decoder (“Text-to-Text”)** | **T5** – Raffel et al. (2020, Google) | *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.* | Unified text-to-text framework. |
| | **Switch Transformer, ByT5, UL2, Flan-T5, Pile-T5** (2021–2024) | Google & others | Scaling, byte-level processing, improved finetuning and transfer learning. |
| **Vision & Multimodal** | **ViT** – Dosovitskiy et al. (2021) | *An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.* | First pure-Transformer vision model. |
| | **Swin Transformer** – Liu et al. (2022, Microsoft) | *Swin Transformer V2.* | Hierarchical, scalable to high-res vision tasks. |
| | **Perceiver / Perceiver IO** – Jaegle et al. (2021, DeepMind) | *General Perception with Iterative Attention.* | Multimodal, data-agnostic, bottleneck attention. |

---

##  Summary Table

| **Category** | **Example Models** |
|--------------|---------------------|
| Encoder-only | BERT, RoBERTa, ALBERT, ELECTRA |
| Decoder-only (Autoregressive) | GPT, GPT-2, GPT-3 |
| Encoder–Decoder | T5 (+ Switch, ByT5, UL2, Flan-T5, Pile-T5) |
| Vision / Multimodal | ViT, Swin Transformer, Perceiver (and IO) |

---

##  Research Sources
- Lin et al. (2021). *A Survey of Transformers.* arXiv.  
- **Vision Transformers (ViT, Swin)** – arXiv, Wikipedia.  
- **Perceiver / Perceiver IO** – DeepMind papers, Wikipedia.  


#  Transformer-Based Models in Deep Learning (2017–2025)

---

## 🔹 Foundation

| **Model** | **Year** | **Authors / Org** | **Key Idea** |
|-----------|----------|-------------------|--------------|
| **Transformer** | 2017 | Vaswani et al., Google Brain | *Attention Is All You Need.* Introduced self-attention + encoder–decoder architecture, eliminated recurrence. |

---

## 🔹 Encoder-Only Models (Bidirectional, Masked LM)

| **Model** | **Year** | **Authors / Org** | **Key Idea** |
|-----------|----------|-------------------|--------------|
| **BERT** | 2018 | Devlin et al., Google AI | Bidirectional masked LM, pre-training + fine-tuning paradigm. |
| **RoBERTa** | 2019 | Liu et al., Facebook AI | Optimized BERT training: more data, larger batches, no NSP task. |
| **ALBERT** | 2019 | Lan et al., Google & Toyota Research | Lightweight BERT with parameter sharing + SOP objective. |
| **ELECTRA** | 2020 | Clark et al., Google Research | Pre-training with Replaced Token Detection (discriminator-based). |
| **DeBERTa** | 2020 | He et al., Microsoft | Disentangled attention + improved position encoding. |

---

## 🔹 Decoder-Only Models (Autoregressive, Generative)

| **Model** | **Year** | **Authors / Org** | **Key Idea** |
|-----------|----------|-------------------|--------------|
| **GPT** | 2018 | Radford et al., OpenAI | Generative pretraining with left-to-right autoregressive Transformer. |
| **GPT-2** | 2019 | Radford et al., OpenAI | Larger GPT, strong unsupervised text generation. |
| **GPT-3** | 2020 | Brown et al., OpenAI | *Language Models are Few-Shot Learners.* 175B parameters. |
| **GPT-4** | 2023 | OpenAI | Multimodal Transformer, advanced reasoning, strong alignment. |

---

## 🔹 Encoder–Decoder Models (Text-to-Text Framework)

| **Model** | **Year** | **Authors / Org** | **Key Idea** |
|-----------|----------|-------------------|--------------|
| **T5** | 2020 | Raffel et al., Google | Unified text-to-text framework across tasks. |
| **Switch Transformer** | 2021 | Fedus et al., Google | Mixture-of-Experts scaling → trillion-parameter models. |
| **ByT5** | 2021 | Xue et al., Google | Byte-level input/output Transformer. |
| **UL2** | 2022 | Tay et al., Google | Unified pretraining across multiple objectives. |
| **Flan-T5 / Flan-PaLM** | 2022 | Chung et al., Google | Instruction tuning for better zero-shot generalization. |

---

## 🔹 Vision & Multimodal Transformers

| **Model** | **Year** | **Authors / Org** | **Key Idea** |
|-----------|----------|-------------------|--------------|
| **ViT (Vision Transformer)** | 2021 | Dosovitskiy et al., Google | First pure-Transformer vision model (*“An Image is Worth 16×16 Words”*). |
| **Swin Transformer** | 2021 | Liu et al., Microsoft | Hierarchical shifted-window design for scalable vision tasks. |
| **Perceiver / Perceiver IO** | 2021 | Jaegle et al., DeepMind | General, modality-agnostic attention bottleneck architecture. |
| **CLIP** | 2021 | Radford et al., OpenAI | Contrastive pretraining on paired image–text data. |
| **PaLM-E** | 2023 | Google Research | Embodied multimodal Transformer (vision + language + robotics). |

---

## ✅ Summary Families

- **Encoder-only:** BERT → RoBERTa → ALBERT → ELECTRA → DeBERTa.  
- **Decoder-only:** GPT → GPT-2 → GPT-3 → GPT-4.  
- **Encoder–Decoder:** T5 → Switch → ByT5 → UL2 → Flan-T5/PaLM.  
- **Vision/Multimodal:** ViT → Swin → Perceiver → CLIP → PaLM-E.  
