A **finite-diffence gradient check** (sometimes called a _numerical gradient check_) is quick way to verify that your **backward()** math is correct by comparing it to an approximate gradient computed from the forward function only.

### Idea

if you slightly nudge one input value $x_i$ up and down by a tiny $\epsilon$, you can estimate:

$$\frac{\partial L}{\partial x_i} \approx \frac{L\left( x_i + \epsilon \right) - L\left(x_i - \epsilon\right)} {2 \epsilon}$$

Then you compare this "numerical" gradient with your backprops result.

### Minimal example for LayerNorm

We'll create a fake loss:

$$ \mathcal{L} = \sum\left(layer\_norm(X) \odot G\right)$$


where $G$ is a random upstream gradient (same shape as output).
This makes the analytical gradient equal to
```Python
layer_norm_backward(G, cache)
```


In [1]:
import numpy as np

In [4]:
from llm_operator import layer_norm, layer_norm_backward, Layer_norm_cache

In [None]:
rng = np.random.default_rng(1000)

In [3]:
def grad_check_layer_norm(layer_norm, layer_norm_backward, eps=1e-6):
    rng = np.random.default_rng(1000)

    B, T, C = 2, 4, 6

    X = -rng.random((B, T, C), dtype=np.float64)
    G = -rng.random((B, T, C), dtype=np.float64) # upstream gradient

    # forward
    Y, cache = layer_norm(X, epsilon = 1e-5)

    # analytical gradient
    dX = layer_norm_backward(G, cache)

    # numerical gradient
    dX_num = np.zeros_like(X)
    it = np.nditer(X, flags=['multi_index'], op_flags=['readwrite'])

    while not it.finished:
        idx = it.multi_index
        old = X[idx]

        X[idx] = old + eps
        Yp, _ = layer_norm(X, epsilon=1e-5)
        Lp = np.sum(Yp * G)

        X[idx] = old - eps
        Ym, _ = layer_norm(X, epsilon=1e-5)
        Lm = np.sum(Ym * G)

        X[idx] = old
        dX_num[idx] = (Lp - Lm) / (2 * eps)

        it.iternext()

    # compare

    max_abs = np.max(np.abs(dX - dX_num))
    rel = max_abs / (np.max(np.abs(dX) + np.abs(dX_num)) + 1e-22)


    print("max_abs_diff:", max_abs)
    print("relative_diff:", rel)

In [5]:
grad_check_layer_norm(layer_norm, layer_norm_backward)

max_abs_diff: 1.1812836819835582e-09
relative_diff: 2.61470216195465e-10
