### **Add and Norm in the Attention Architecture**

The **Add and Norm** step is a key component in the Transformer architecture, particularly in the **Encoder** and **Decoder** layers. It is designed to stabilize the learning process and facilitate the flow of gradients during backpropagation. Here's what it does:

---

### **1. Add: Residual Connection**
- The **Add** operation implements a **residual connection**, which helps mitigate the vanishing gradient problem and accelerates training. 
- The output of a sub-layer (e.g., Self-Attention or Feed-Forward Network) is **added** to the input of that sub-layer.

---

### **2. Norm: Layer Normalization**
- After the residual connection, **Layer Normalization** is applied to the combined output to ensure that activations are normalized.

### **Key Benefits of Add and Norm**
1. **Stabilized Training:**
   - Layer normalization ensures consistent scaling of activations across layers.
2. **Efficient Gradient Flow:**
   - Residual connections improve gradient flow during backpropagation, allowing deeper architectures to train effectively.
3. **Improved Representation Learning:**
   - Residual connections let the model focus on learning refinements over identity mappings.

---

### **Add and Norm in Transformer Layers**
In a Transformer layer, **Add and Norm** is applied twice:
1. **After the Multi-Head Attention Sub-layer:**
   - Combines the attention output with the input embedding or previous layer's output.
2. **After the Feed-Forward Sub-layer:**
   - Combines the feed-forward output with the result from the previous Add and Norm step.

### **Layer Normalization in Transformers**

**Layer Normalization (LayerNorm)** is a normalization technique used in Transformer models to stabilize and accelerate training. It operates at the level of individual training examples and normalizes the activations within a layer to have a mean of 0 and a standard deviation of 1. Here's a detailed breakdown:

---
### How Does LayerNorm Work?
Given a vector of activations \( x \) at a specific layer, LayerNorm computes:

$$
\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta
$$
- **μ:** Mean of the activations across the layer.
- **σ ^2:** Variance of the activations across the layer.
- **ϵ:** A small constant to prevent division by zero.
- **γ,β:** Learnable parameters (scale and shift) that allow the model to restore capacity after normalization.

---

### **LayerNorm in Transformers**
1. **Placement in the Architecture:**
   - In the Transformer, LayerNorm is typically applied **before or after each sub-layer** (self-attention or feed-forward network).
   - For example, in "Pre-Norm" Transformers, LayerNorm is applied before the sub-layer, while in "Post-Norm" Transformers, it's applied after.

2. **Benefits for Self-Attention:**
   - Self-attention involves computing weighted sums of embeddings, which can have high variance. LayerNorm ensures these values are scaled appropriately.

3. **Contrast with BatchNorm:**
   - Unlike Batch Normalization, which normalizes over a batch of examples, LayerNorm operates independently for each training example, making it more suitable for sequential data like text.

In [1]:
import torch
from torch import nn

In [2]:
inputs = torch.Tensor([[[0.2, 0.1, 0.3], [0.5, 0.1, 0.1]]])
B, S, E = inputs.size()
inputs = inputs.reshape(S, B, E)
inputs.size()

torch.Size([2, 1, 3])

The reshape is necessary to reorder the dimensions of the tensor from **([B, S, E]\) to ([S, B, E]),** which aligns with the expected input format for Transformer layers or operations like Multi-Head Attention. This order ensures compatibility with sequence-processing operations and standardizes the data pipeline. While the size remains the same, the dimensional layout adapts for downstream tasks.

In [3]:
parameter_shape = inputs.size()[-2:]
gamma = nn.Parameter(torch.ones(parameter_shape))
beta =  nn.Parameter(torch.zeros(parameter_shape))

In [4]:
gamma.size(), beta.size()

(torch.Size([1, 3]), torch.Size([1, 3]))

In [5]:
dims = [-(i + 1) for i in range(len(parameter_shape))]

In [6]:
dims

[-1, -2]

In [7]:
mean = inputs.mean(dim=dims, keepdim=True)
mean.size()

torch.Size([2, 1, 1])

In [8]:
mean

tensor([[[0.2000]],

        [[0.2333]]])

In [9]:
var = ((inputs - mean) ** 2).mean(dim=dims, keepdim=True)
epsilon = 1e-5
std = (var + epsilon).sqrt()
std

tensor([[[0.0817]],

        [[0.1886]]])

In [10]:
y = (inputs - mean) / std
y

tensor([[[ 0.0000, -1.2238,  1.2238]],

        [[ 1.4140, -0.7070, -0.7070]]])

In [11]:
out = gamma * y + beta

In [12]:
out

tensor([[[ 0.0000, -1.2238,  1.2238]],

        [[ 1.4140, -0.7070, -0.7070]]], grad_fn=<AddBackward0>)

## Class

In [13]:
import torch
from torch import nn

class LayerNormalization():
    def __init__(self, parameters_shape, eps=1e-5):
        self.parameters_shape=parameters_shape
        self.eps=eps
        self.gamma = nn.Parameter(torch.ones(parameters_shape))
        self.beta =  nn.Parameter(torch.zeros(parameters_shape))

    def forward(self, input):
        dims = [-(i + 1) for i in range(len(self.parameters_shape))]
        mean = inputs.mean(dim=dims, keepdim=True)
        print(f"Mean \n ({mean.size()}): \n {mean}")
        var = ((inputs - mean) ** 2).mean(dim=dims, keepdim=True)
        std = (var + self.eps).sqrt()
        print(f"Standard Deviation \n ({std.size()}): \n {std}")
        y = (inputs - mean) / std
        print(f"y \n ({y.size()}) = \n {y}")
        out = self.gamma * y  + self.beta
        print(f"out \n ({out.size()}) = \n {out}")
        return out

In [14]:
batch_size = 3
sentence_length = 5
embedding_dim = 8 
inputs = torch.randn(sentence_length, batch_size, embedding_dim)

print(f"input \n ({inputs.size()}) = \n {inputs}")

input 
 (torch.Size([5, 3, 8])) = 
 tensor([[[ 8.3799e-01,  1.5723e+00, -9.5514e-01, -1.4994e+00,  7.0361e-01,
           1.0687e+00,  2.3581e+00,  1.9513e+00],
         [-4.6570e-01, -6.9836e-01, -6.4936e-01, -5.9919e-02,  6.3958e-01,
           2.6308e-01,  3.1458e-01, -1.7177e-01],
         [-1.5075e-01,  3.0940e-01,  1.3171e-01, -1.1500e+00,  5.6917e-01,
           1.1410e+00,  8.0574e-04, -6.1830e-01]],

        [[-3.3019e-01, -4.3450e-01,  3.8669e-02,  5.5271e-01, -5.4959e-01,
          -1.2138e+00,  1.6253e+00,  5.1907e-01],
         [ 1.6160e+00,  1.1097e-01,  7.0581e-01, -1.5954e+00, -6.9436e-01,
          -1.4114e+00, -1.5395e+00, -9.0551e-01],
         [ 1.2125e-01,  3.6788e-01,  4.3097e-01,  1.8649e-01, -1.4047e+00,
          -8.2758e-01, -4.0525e-01, -3.6059e-01]],

        [[-2.6711e+00,  9.5854e-01, -1.6292e+00,  5.0002e-02,  4.1646e-01,
           5.0564e-01, -7.7049e-01, -1.5622e+00],
         [-1.0538e+00,  1.2483e+00, -1.0638e-01, -4.7291e-01,  5.2497e-01,
          

In [15]:
layer_norm = LayerNormalization(inputs.size()[-1:])

In [16]:
out = layer_norm.forward(inputs)

Mean 
 (torch.Size([5, 3, 1])): 
 tensor([[[ 0.7547],
         [-0.1035],
         [ 0.0291]],

        [[ 0.0260],
         [-0.4642],
         [-0.2364]],

        [[-0.5878],
         [ 0.0503],
         [ 0.4018]],

        [[-0.2384],
         [ 0.0698],
         [ 0.3080]],

        [[-0.3973],
         [ 0.4691],
         [-0.2058]]])
Standard Deviation 
 (torch.Size([5, 3, 1])): 
 tensor([[[1.2641],
         [0.4543],
         [0.6576]],

        [[0.8129],
         [1.0959],
         [0.5996]],

        [[1.1942],
         [0.7262],
         [0.9149]],

        [[1.1428],
         [1.0783],
         [0.6380]],

        [[1.0892],
         [0.8886],
         [0.6554]]])
y 
 (torch.Size([5, 3, 8])) = 
 tensor([[[ 0.0659,  0.6468, -1.3526, -1.7831, -0.0404,  0.2484,  1.2684,
           0.9466],
         [-0.7973, -1.3095, -1.2016,  0.0959,  1.6357,  0.8069,  0.9203,
          -0.1503],
         [-0.2735,  0.4262,  0.1560, -1.7930,  0.8212,  1.6908, -0.0431,
          -0.9845]],



In [17]:
out[0].mean(), out[0].std()

(tensor(-1.4901e-08, grad_fn=<MeanBackward0>),
 tensor(1.0215, grad_fn=<StdBackward0>))