# Gradient Clipping & Gradient Checking

---

## 🔹 Gradient Clipping
**Definition:**  
A technique to prevent the **exploding gradient problem** by restricting the magnitude of gradients before applying parameter updates.  

**Why use it?**  
- Stabilizes training (especially in deep networks & RNNs).  
- Prevents weights from taking very large steps.  

**Types:**
1. **Norm Clipping (preferred):** Scale all gradients down proportionally if their overall norm exceeds a threshold.  

$$
g \; \leftarrow \; \frac{g}{\max\left(1, \frac{\|g\|}{\tau}\right)}
$$  

where:  
- $g$ = gradient vector  
- $\tau$ = threshold  

 

2. **Value Clipping:** Clamp each gradient component to a fixed range:  

$$
g_i \; \in [-c, c]
$$  

---

## 🔹 Gradient Checking
**Definition:**  
A debugging technique to verify correctness of **backpropagation** by comparing analytical gradients with numerical approximations.  

**Why use it?**  
- Detects errors in manual derivative implementation.  
- Ensures your backprop matches expected values.  

**Formula (Central Difference Approximation):**  

$$
\frac{\partial J}{\partial \theta_i} \;\approx\; 
\frac{J(\theta_i + \epsilon) - J(\theta_i - \epsilon)}{2\epsilon}
$$  

where:  
- $J$ = loss function  
- $\theta_i$ = parameter  
- $\epsilon$ = small value (e.g., $10^{-5}$)  

**Process:**  
1. Choose a parameter $\theta_i$.  
2. Perturb by $+\epsilon$ and $-\epsilon$.  
3. Compute the loss in both cases.  
4. Compare with the analytical gradient.  

✅ If the relative error is small (e.g., $< 10^{-7}$), your gradients are correct.


---

## 🔹 Summary
- **Gradient Clipping** → Used *during training* to stabilize updates.  
- **Gradient Checking** → Used *during debugging* to validate backprop.  



In [1]:
#imports 
import torch

import torch.nn as nn
import torch.optim as optim

from torch.nn.utils import clip_grad_norm_, clip_grad_value_

## Gradient Clipping (NumPy demo)

We’ll implement two clipping methods:

- **Norm clipping**: scale the whole gradient so its L2 norm ≤ τ  

$$
g \;\leftarrow\; \frac{g}{\max\!\Big(1,\; \frac{\|g\|_2}{\tau}\Big)}
$$  

- **Value clipping**: clamp each component to a fixed range  

$$
g_i \;\in\; [-c, c]
$$  

We’ll print the gradient **before/after** to see the effect.


In [2]:
# NumPy demo: gradient clipping (norm & value)
import numpy as np

def clip_by_norm(g, tau: float):
    """
    Scale gradient vector g so that ||g||_2 <= tau (if needed).
    """
    norm = np.linalg.norm(g, ord=2)
    scale = max(1.0, norm / tau)
    return g / scale, norm, np.linalg.norm(g / scale, ord=2)

def clip_by_value(g, c: float):
    """
    Clamp each component of g to [-c, c].
    """
    g_clipped = np.clip(g, -c, c)
    return g_clipped

# Example

In [3]:
# Example usage of our clipping functions
rng = np.random.default_rng(0)
g = rng.normal(0.0, 5.0, size=8)   # pretend gradient vector

print("Original g:", np.round(g, 3))
print("Original ||g||:", round(np.linalg.norm(g), 3))

# ---- Norm clipping ----
tau = 5.0
g_norm, norm_before, norm_after = clip_by_norm(g, tau)
print("\n[Norm clipping]")
print("τ =", tau)
print("||g|| before:", round(norm_before, 3))
print("||g|| after :", round(norm_after, 3))
print("Clipped g   :", np.round(g_norm, 3))

# ---- Value clipping ----
c = 2.0
g_val = clip_by_value(g, c)
print("\n[Value clipping]")
print("c =", c)
print("Clipped g   :", np.round(g_val, 3))
print("||g|| after :", round(np.linalg.norm(g_val), 3))


Original g: [ 0.629 -0.661  3.202  0.525 -2.678  1.808  6.52   4.735]
Original ||g||: 9.313

[Norm clipping]
τ = 5.0
||g|| before: 9.313
||g|| after : 5.0
Clipped g   : [ 0.338 -0.355  1.719  0.282 -1.438  0.971  3.5    2.542]

[Value clipping]
c = 2.0
Clipped g   : [ 0.629 -0.661  2.     0.525 -2.     1.808  2.     2.   ]
||g|| after : 4.514


## Gradient Checking (NumPy, finite differences)

We check our analytic gradient against a numerical approximation using:

$$
\frac{\partial J}{\partial \theta_i} \;\approx\; 
\frac{J(\theta_i + \epsilon) - J(\theta_i - \epsilon)}{2\epsilon}
$$  

- Test function:  

$$
J(\theta) = \tfrac{1}{2}\,\theta^\top A \theta + b^\top \theta
$$  

Analytic gradient:  

$$
\nabla_\theta J(\theta) = \tfrac{1}{2}(A + A^\top)\theta + b
$$  

Relative error metric:  

$$
\text{rel\_err} \;=\; \frac{\|\nabla J_{\text{analytic}} - \nabla J_{\text{numeric}}\|_2}
{\|\nabla J_{\text{analytic}}\|_2 + \|\nabla J_{\text{numeric}}\|_2 + 10^{-12}}
$$  

A value below ~\(10^{-7}\) means the gradients match very well.


In [4]:
# NumPy demo: gradient checking (finite differences)
import numpy as np

def loss(theta, A, b):
    """Quadratic loss: J(θ) = 0.5 θᵀ A θ + bᵀ θ"""
    return 0.5 * theta @ (A @ theta) + b @ theta

def grad_analytic(theta, A, b):
    """Analytic gradient: ∇J(θ) = 0.5(A + Aᵀ)θ + b"""
    return 0.5 * (A + A.T) @ theta + b

def grad_numeric(theta, A, b, eps=1e-5):
    """Numeric gradient via central difference."""
    g = np.zeros_like(theta)
    for i in range(len(theta)):
        e = np.zeros_like(theta); e[i] = 1.0
        g[i] = (loss(theta + eps*e, A, b) - loss(theta - eps*e, A, b)) / (2*eps)
    return g

def relative_error(g1, g2):
    """Compare two gradients with relative error."""
    num = np.linalg.norm(g1 - g2)
    den = np.linalg.norm(g1) + np.linalg.norm(g2) + 1e-12
    return num / den


# Example

In [5]:
# Example usage of gradient checking
np.random.seed(0)
d = 5

# random quadratic problem
A = np.random.randn(d, d)
A = A.T @ A + 0.1*np.eye(d)
b = np.random.randn(d)
theta = np.random.randn(d)

# compute gradients
g_ana = grad_analytic(theta, A, b)
g_num = grad_numeric(theta, A, b, eps=1e-5)
err = relative_error(g_ana, g_num)

# print results
print("θ:", np.round(theta, 3))
print("\nAnalytic grad:", np.round(g_ana, 6))
print("Numeric grad :", np.round(g_num, 6))
print("\nRelative error:", err)
if err < 1e-7:
    print("Gradients match!")
else:
    print("Gradients differ!")


θ: [ 0.155  0.378 -0.888 -1.981 -0.348]

Analytic grad: [-10.905588  -1.915013  -6.432957 -10.842264  -9.894571]
Numeric grad : [-10.905588  -1.915013  -6.432957 -10.842264  -9.894571]

Relative error: 3.153475444993319e-12
Gradients match!


## Gradient Clipping in PyTorch (training loop)

PyTorch optimizers **don’t clip gradients automatically**.  
You call a utility **before** `optimizer.step()`:

```python
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=τ)
# or
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=c)


In [6]:
from torch.nn.utils import clip_grad_norm_, clip_grad_value_

# Example training loop
torch.manual_seed(0)

# fake regression dataset
X = torch.randn(64, 5)
true_w = torch.randn(5, 1)
y = X @ true_w + 0.1 * torch.randn(64, 1)

# simple linear model
model = nn.Linear(5, 1)
opt = torch.optim.SGD(model.parameters(), lr=0.2)
loss_fn = nn.MSELoss()

# choose thresholds
tau = 1.0  # for norm clipping
c = 0.5    # for value clipping

for step in range(5):
    opt.zero_grad()
    pred = model(X)
    loss = loss_fn(pred, y)
    loss.backward()

    # --- apply clipping ---
    grad_norm_before = torch.sqrt(sum((p.grad.norm()**2 for p in model.parameters())))
    clip_grad_norm_(model.parameters(), max_norm=tau)   # norm clipping
    
    grad_norm_after = torch.sqrt(sum((p.grad.norm()**2 for p in model.parameters())))

    opt.step()

    print(f"Step {step} | Loss {loss.item():.4f} | "
          f"Grad norm before {grad_norm_before:.3f} → after {grad_norm_after:.3f}")


Step 0 | Loss 5.6410 | Grad norm before 4.358 → after 1.000
Step 1 | Loss 4.8046 | Grad norm before 4.006 → after 1.000
Step 2 | Loss 4.0385 | Grad norm before 3.656 → after 1.000
Step 3 | Loss 3.3419 | Grad norm before 3.310 → after 1.000
Step 4 | Loss 2.7144 | Grad norm before 2.966 → after 1.000
