### Seq2Seq = Framework

Seq2Seq (Sequence-to-Sequence) is a high-level architecture, not a specific model. It’s defined by:

- **Encoder**: processes input sequence into a fixed representation.
- **Decoder**: generates the output sequence from that representation.

**What powers the encoder/decoder in Seq2Seq?**

They can be built using different kinds of layers:

1. RNN-based Seq2Seq (RNN encoder and RNN decoder)
2. LSTM-based Seq2Seq
3. GRU-based Seq2Seq
4. Transformer-based Seq2Seq

All these are building blocks for the encoder and decoder in Seq2Seq models.

**Umbrella view of Seq2Seq and related architectures**

---

### 1. **Core Concepts**

* **Sequence-to-Sequence (Seq2Seq) Architecture**

  * Encoder
  * Decoder
  * Context vector (or intermediate representation)
  * Teacher Forcing
  * Auto-regressive decoding

---

### 2. **Recurrent-Based Models**

Used as early Seq2Seq encoders and decoders.

#### Basic Units

* RNN (Recurrent Neural Network)
    * Vanilla RNN - with recurrent connections
    * Bidirectional RNN(BiRNN)
    * Stacked RNN
    * Residual RNN
* LSTM (Long Short-Term Memory)
    * Vanilla LSTM
        - standard LSTM cell with forget, input, and output gates and a cell state
    * Peephole LSTM
        - adds connections from the cell state to gates(peephole) for better precision
    * Coupled Gate LSTM
        - simplifies LSTM by coupling input and forget gates
    * BiLSTM
        - processes input from both directions (seq left2right and right2left) (only used in encoders)
* GRU (Gated Recurrent Unit)'
    * Vanilla GRU
    * BiGRU


#### Seq2Seq Encoder-Decoder Recurrent Variants

* RNN Encoder → RNN Decoder
* LSTM Encoder → LSTM Decoder
* GRU Encoder → GRU Decoder
* Bi-directional RNN/LSTM/GRU Encoder
* Attention-enhanced RNNs

---

### 3. **Attention Mechanisms**

Solves the bottleneck of fixed-size context vector in classic Seq2Seq

* General Attention
    - Weights encoder hidden states based on decoder state
        - Bahdanau (Additive) Attention
        - Luong (Multiplicative) Attention
* Self-Attention
    - Compares every token with every other token in the sequence (used in Transformers)

---




### 4. **Transformer-Based Architectures**

Modern alternative to RNNs. Use self-attention instead of recurrence. Fully parallelizable. Backbone for BERT, GPT, T5.

* Transformer (Encoder-Decoder)
* Multi-Head Self-Attention
* Positional Encoding

**Encoder-Only**
- BERT
- RoBERTa

**Decoder-Only**

- GPT, GPT-2, GPT-3, GPT-4

**Encoder-Decoder**

- T5, BART, MarianMT

---

### 5. **Language Models & Variants**
Predict next word or generate text

* Traditional LMs
    - N-gram Models

* Neural LMs
    - RNN-LM
    - LSTM-LM
    - GRU-LM
    - Transformer LMs (e.g., GPT)

* Types by Behavior
- Autoregressive: GPT(Generative Pretrained Transformer), RNN-LM
- Masked: BERT(Bidirectional Encoder Representations from Transformers)
- Seq2Seq-style: T5, BART

---

### 6. **Autoencoders**

Encode and reconstruct input, not necessarily for sequence generation. Can be extended for sequences. Unsupervised encoders for learning representations; used in both vision and language.

* Vanilla Autoencoders (dense)
* Sequence Autoencoders
* Variational Autoencoders (VAEs)
* Denoising Autoencoders

---

### 7. **Specialized or Extended Architectures**

For advanced or niche tasks (e.g., ASR, memory handling).

* Pointer Networks
* Copy Mechanisms
* Memory Networks
* Recursive Neural Networks (TreeRNNs)
* RNN-Transducer (used in ASR, speech recognition)
* Hybrid CNN-RNN models

---



```
Seq2Seq
├── Encoder-Decoder
│   ├── RNN
│   ├── LSTM
│   │   ├── Vanilla
│   │   ├── Coupled Gate
│   │   ├── Peephole
│   │   └── BiLSTM
│   ├── GRU
│   └── Attention
│       ├── General (Bahdanau, Luong)
│       └── Self-Attention
├── Transformer (Attention-based)
│   ├── BERT (Encoder)
│   ├── GPT (Decoder)
│   └── T5, BART (Encoder-Decoder)
└── Related
    ├── Autoencoders
    └── Variational Autoencoders
```