
# **Architecture of Transformer**

## **1. Overview**

The Transformer architecture, introduced in the paper *"Attention Is All You Need"* (Vaswani et al., 2017), is a neural network design that completely replaces recurrence and convolution with the **self-attention mechanism**.
It is widely used in **Natural Language Processing (NLP)**, especially in models such as BERT, GPT, and T5.

---

## **2. Core Components**

A standard Transformer consists of **two main blocks**:

* **Encoder**: Processes the input sequence and creates contextual representations.
* **Decoder**: Generates the output sequence, attending to encoder outputs and previously generated tokens.

---

## **3. Detailed Architecture**

### **A. Encoder**

Each encoder layer contains:

1. **Input Embedding + Positional Encoding** – Adds positional context to embeddings.
2. **Multi-Head Self-Attention Layer** – Computes relationships between all tokens in the input.
3. **Add & Norm** – Residual connection followed by layer normalization.
4. **Feed-Forward Neural Network (FFN)** – Two fully connected layers with non-linear activation (ReLU/GELU).
5. **Add & Norm** – Another residual connection and normalization.

---

### **B. Decoder**

Each decoder layer contains:

1. **Output Embedding + Positional Encoding**
2. **Masked Multi-Head Self-Attention** – Ensures the model doesn’t peek ahead at future tokens.
3. **Add & Norm**
4. **Encoder–Decoder Attention** – Allows the decoder to focus on relevant encoder outputs.
5. **Add & Norm**
6. **Feed-Forward Neural Network**
7. **Add & Norm**
8. **Linear & Softmax Layer** – Generates the probability distribution over the vocabulary.

---

## **4. Transformer Block Flow**

```plaintext
Input Sequence
   ↓
Embedding + Positional Encoding
   ↓
[Encoder Layers × N]
   ↓
Encoded Representations
   ↓
[Decoder Layers × N]
   ↓
Softmax Output
```

---

## **5. Key Features**

* **No Recurrence** – Processes sequences in parallel.
* **Self-Attention** – Captures long-range dependencies efficiently.
* **Positional Encoding** – Adds sequential order information.
* **Scalability** – Works well for large datasets and models.
