
---

### ✨ Why Normalization Helps in Deep Learning

In gradient-based learning models, maintaining **stable input distributions** is critical for efficient and reliable training. If the inputs to a layer are not well-scaled — i.e., if their magnitudes are too large (exploding) or too small (vanishing) — it can lead to **unstable gradients** during backpropagation. This causes weight updates to become erratic or ineffective, slowing down or even preventing convergence.

During training, as data flows through layers and undergoes various transformations (especially non-linearities like ReLU, tanh, etc.), the distribution of activations can shift — a phenomenon known as **internal covariate shift**. This shift alters the input statistics seen by each layer over time, making the optimization landscape more volatile and harder to navigate.

**Normalization addresses this problem** by ensuring that the inputs to a layer maintain a consistent scale (typically with zero mean and unit variance). By normalizing the inputs — either across the batch (BatchNorm), within each sample (LayerNorm), or across groups of features (GroupNorm) — we help:

- **Prevent exploding or vanishing gradients**
- **Stabilize training** and make it more predictable
- **Accelerate convergence** by allowing higher learning rates
- **Improve generalization**, sometimes even acting like a regularizer

The core idea is simple:  
> ⚙️ **Keep the inputs to each layer well-behaved**, because the output of one layer becomes the input to the next.  
By maintaining a stable scale of activations throughout the network, normalization allows deep models to train efficiently and generalize better.

---



---

### 🧠 **Batch Normalization (BatchNorm)**

BatchNorm **normalizes across the batch**, and it typically operates **per feature/channel** (not across features like LayerNorm).

---

### 🔍 Example: Let's say you have an input of shape:

For NLP (e.g., transformer input):
```
(batch_size, seq_len, hidden_dim)
```

For CNNs (e.g., images):
```
(batch_size, channels, height, width)
```

---

### ✅ BatchNorm Steps (Training Phase):

For **each feature dimension** `d` (in NLP) or **each channel** (in CNN), BatchNorm computes:

$$
\mu_d = \frac{1}{N} \sum_{i=1}^{N} x_i^d, \quad \sigma_d = \sqrt{ \frac{1}{N} \sum_{i=1}^{N} (x_i^d - \mu_d)^2 + \epsilon }
$$

- Here, \( N \) is the number of values in the batch (and possibly across spatial dims too, like height and width in CNNs)
- \( x_i^d \) is the value of feature `d` for the `i`th data point
- This means: for each feature/channel `d`, normalize using the **mean and std of that feature across the batch**

$$
\text{BatchNorm}(x^d) = \gamma_d \cdot \frac{x^d - \mu_d}{\sigma_d} + \beta_d
$$

> **Important:** During training, these batch-wise statistics are also used to update **running estimates** of the mean and variance for each feature using exponential moving averages.

---

### 🧪 BatchNorm at **Inference Time**

During inference, we **do not compute statistics from the incoming batch**, because:
- The batch size may be small (even size 1)
- We want consistent behavior

Instead, we use the **running estimates** of mean and variance accumulated during training:

$$
\text{mean}_{\text{new}} = (1 - \alpha) \cdot \text{mean}_{\text{old}} + \alpha \cdot \text{mean}_{\text{batch}}
$$

$$
\text{var}_{\text{new}} = (1 - \alpha) \cdot \text{var}_{\text{old}} + \alpha \cdot \text{var}_{\text{batch}}
$$


Where:
- \( mean_new \) and \( var_new \) are the **running mean and std** (or variance) tracked during training
- These are stored in the layer and **frozen** during inference

---

### 🚨 Key Differences Between BatchNorm and LayerNorm

| Property              | **BatchNorm**                                                  | **LayerNorm**                                      |
|----------------------|----------------------------------------------------------------|---------------------------------------------------|
| Normalizes over      | **Batch dimension** (per feature/channel)                      | **Feature dimension** (per sample)                |
| Statistics used       | Batch-wise mean/variance (training) or running stats (inference) | Sample-wise mean/variance (always)              |
| Sensitive to batch size? | ✅ Yes — small batch sizes can cause instability           | ❌ No — works even with batch size 1              |
| Used in              | CNNs, older RNNs                                               | Transformers, newer RNNs, all NLP models          |
| Behavior at inference| Uses **stored running averages** of mean/variance              | Recomputes from current input each time           |
| Learnable params     | γ and β per feature/channel                                    | γ and β per feature                               |

---


## Layer Normalization

In **Layer Normalization**, we **do *not*** normalize across the *sequence/time steps*. Instead, **we normalize across the features/dimensions within a single input vector**, *for each element in the batch independently*.

### Let's break it down properly:

Suppose your input tensor is shaped like:

```
(batch_size, seq_len, hidden_dim)
```

LayerNorm is applied **at each time step independently**, and **across the `hidden_dim`** — not across sequence or batch.

So for a single vector:
```
x = [x₁, x₂, ..., x_d]  ∈ ℝ^d   (this is one hidden vector, e.g. for one time step)
```

LayerNorm computes:

$$
\mu = \frac{1}{d} \sum_{i=1}^{d} x_i, \quad \sigma = \sqrt{ \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2 + \epsilon }
$$

Then the normalized output is:

$$
\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sigma} + \beta
$$

- `γ` (gamma) and `β` (beta) are **learnable parameters**, one per feature dimension (`d` total)
- `∘` is element-wise multiplication

So the correct way to say it is:

> In LayerNorm, we take **each individual input vector** (e.g. per time step), and normalize **across its features** using the formula:
>
> $$
> \text{output} = \gamma \cdot \frac{x - \mu}{\sigma} + \beta
> $$
>
> This ensures that the normalized vector has zero mean and unit variance **per vector**, and the learnable `γ` and `β` allow the network to re-scale and shift as needed.

- In layer normalization there is no need to maintain running mean and variance as we are normalizing it along the features. so can be used easily during inference time.

## Question ?
---

> In the case of Batch Normalization, the idea makes statistical sense — we compute the mean and variance over a mini-batch for each feature dimension, effectively treating the batch as a sample to prevent scale instability across features. This aligns well with the statistical principle that normalization should be based on a group of observations.
>
> However, in Layer Normalization, we compute the mean and variance across all feature dimensions of a single input vector (e.g., a 512-dimensional embedding), without aggregating across multiple samples. From a statistical standpoint, this seems less intuitive, since we're not sampling over a population or batch. Doesn't this approach violate the principle that normalization should rely on a sample of values to estimate meaningful statistics?
>

---

### Why “normalize across the *features* of one vector” actually helps
LayerNorm was introduced (Ba et al., 2016) after people tried BatchNorm inside RNNs and discovered two practical road-blocks:

| Issue with Batch Statistics | What happens in RNNs/Transformers |
|---|---|
| **Statistics depend on other samples.**  A token at position *t* sees a different mean/var when the mini-batch composition changes. | The hidden state of token *t* is repeatedly reused at every time-step / attention block; if its scale keeps drifting, gradients explode or vanish. |
| **Small or varying batch sizes are common.**  BatchNorm gets noisy or unusable when `batch_size≈1`. | Autoregressive decoding, on-device inference, curriculum learning, very long sequences → batches can be tiny, even size 1. |

LayerNorm side-steps both problems by **treating each hidden vector as its own “mini-batch”.**

---

#### Intuition ① – keep each vector on a well-behaved manifold
Every hidden vector **h ∈ ℝ<sup>d</sup>** (e.g. `d = 512`) is pushed onto a *learnable hyper-ellipsoid*:

$$
\tilde h = \gamma \odot \frac{h-\mu}{\sigma} + \beta ,
\qquad 
\mu=\tfrac1d\sum_{i=1}^{d}h_i ,
\quad 
\sigma=\sqrt{\tfrac1d\sum_{i=1}^{d}(h_i-\mu)^2+\varepsilon}.
$$

- **Zero-centre & unit-variance** ⇒ dot-products, softmax logits, gating sigmoids, etc. all see inputs of comparable scale.  
- **γ, β** immediately restore any scale/shift the next layer *wants*, so we don’t lose representation power; we only stop *uncontrolled drift*.

Think of it as replacing “raw space” with **a coordinate system whose axes have the same statistical weight at every step.**

---

#### Intuition ② – self-attention is a massive dot-product machine  
In transformers the core operation is  

$$
\text{Attention}(Q,K,V)=\text{softmax}\!\Bigl(\tfrac{QK^\top}{\sqrt{d_k}}\Bigr)V .
$$

If individual token vectors explode, the *scale* of \(QK^\top\) explodes ⇒ vanishing gradients through the softmax.  
LayerNorm keeps each vector length roughly the same so that the \(\tfrac1{\sqrt{d_k}}\) factor truly stabilises things.

---

#### Intuition ③ – per-sample whitening works like adaptive learning-rate  
For a single linear layer \(y=W h\), the gradient w.r.t. \(h_i\) is proportional to \(W_{:,i}\).  
If some coordinates of \(h\) are routinely 10× larger, their gradients are 10× larger too ⇒ training becomes imbalanced.  
LayerNorm rescales *inside the network* so every coordinate learns at a comparable speed, similar to Adam’s per-parameter √variance term but embedded in the forward pass.

---

#### “But aren’t 512 features too few to estimate a variance?”
We’re **not** trying to estimate the *population* variance of some random variable; we’re enforcing a *constraint* on *this particular vector*.  
Per-feature fluctuations between tokens are exactly what we want to damp: if a token suddenly gets a huge value in one dimension because of an unlucky softmax, LayerNorm reins it in *immediately*, without waiting for other samples.

---

### Putting it back in NLP context
1. **Embedding lookup** already gives you heavy-tailed coordinates (some tokens use rare sub-spaces).  
2. Feed-forward + attention layers mix those coordinates non-linearly; their scale can drift over depth.  
3. LayerNorm “flushes” that drift at every block, making depth-wise residual addition (`x + F(x)`) numerically safe.  
4. Because it’s batch-agnostic, you can:
   * run sequence-by-sequence during inference,
   * fine-tune with very small batches,
   * shuffle tokens freely without changing behaviour.

That is why almost every transformer variant since 2017 uses LayerNorm (often twice per block), while BatchNorm virtually disappeared from sequence models.


---

## Quick mental picture

```
BatchNorm:  ←mean/var over 32 samples→  (per dim)   =====>  stabilise feature *population*
LayerNorm:  ←mean/var over 512 dims→    (per sample) =====>  stabilise *each* sample’s vector
```

Different axes, same core goal: **smooth training by taming activation scale—just tailored to the setting (images vs. sequences).**

---


---

### 🔁 **What Are Residual (Skip) Connections in Transformers?**

In Transformer architectures, **residual connections** are used to **bypass** a layer's transformation and directly **add** the layer's input to its output.

Formally, for a sub-layer (like self-attention or feed-forward layer), the residual connection is:

$$
\text{Output} = \text{LayerNorm}(x + \text{Sublayer}(x))
$$

Where:
- `x` is the input to the sub-layer,
- `Sublayer(x)` is the transformation (like attention or MLP),
- `x + Sublayer(x)` is the residual connection (a skip),
- `LayerNorm` is applied **after** the addition (in post-LN transformers; some variants use pre-LN).

---

### ⚙️ **How Residual Connections Help Gradient Flow**

The main benefit comes during **backpropagation** — when gradients are propagated from output back to input. Let's see **why they're useful in deep networks**:

#### 🔻 Without Skip Connections:
- Suppose you have a very deep network, say 48 layers.
- Each layer applies a nonlinear transformation (e.g., ReLU, attention, etc.).
- If each transformation reduces the gradient just a little (say by multiplying with a small value), then **after many layers**, the gradients can shrink to almost zero → **vanishing gradients**.
- Result: early layers stop learning.

#### 🔄 With Skip Connections:
- The gradient has a **direct path** to flow backward through the skip.
- Since the output is \( x + \text{F}(x) \), the gradient of the loss \( L \) w.r.t. \( x \) is:

$$
\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \left(1 + \frac{\partial F(x)}{\partial x}\right)
$$

- The presence of that **`1`** ensures that **some part of the gradient always flows unchanged**.
- This allows **deeper layers to still receive meaningful gradients**, even when `F(x)` becomes small or vanishes.

---

### 🧠 Why It Works So Well

1. **Identity Path:** The skip connection behaves like an identity map — gradients flow even if the main layer `F(x)` is untrained or weakly contributing.
2. **Gradient Preservation:** Each layer adds a small delta on top of identity — like learning *residuals*, not full representations.
3. **Ease of Optimization:** The network is encouraged to learn modifications to identity rather than new functions from scratch.

---

### ✅ In Transformers Specifically:
- Each attention and feed-forward layer has a residual path.
- This is critical because transformers are **deep stacks of blocks**.
- Without residuals, models like GPT-3, BERT, or ViT wouldn't train effectively due to **gradient degradation**.

---
