## Numerical Differentiation

Numerical differentiation approximates the gradient by perturbing the inputs and observing changes in the output. Here, we calculate the gradient numerically for the weights.

## Numerical Gradient Calculation

We use the central difference method for numerical differentiation. Given a function $ f(w) $, the gradient of $ f $ at $ w $ is approximated by:

$$
\frac{\partial f}{\partial w_i} \approx \frac{f(w + \epsilon e_i) - f(w - \epsilon e_i)}{2\epsilon}
$$

where $ \epsilon $ is a small number (e.g., $ 10^{-5} $) and $ e_i $ is the unit vector in the direction of the $ i $-th parameter.

Let's compute the numerical gradient for $ w_{11} $:

1. Choose $ \epsilon = 10^{-5} $.
2. Compute $ L(w_{11} + \epsilon) $ and $ L(w_{11} - \epsilon) $.
3. Let $ x_1 = 1 $, $ x_2 = 2 $, $ t = 5 $, and the initial weights are:
   - $ w_{11} = 0.1 $
   - $ w_{12} = -0.2 $
   - $ w_{21} = 0.3 $
   - $ w_{22} = 0.4 $
   - $ v_1 = 0.5 $
   - $ v_2 = -0.5 $
4. **Forward Pass for $ w_{11} + \epsilon $:**
   - $ w_{11} + \epsilon = 0.10001 $
   - $ h_1 = (0.10001) \cdot 1 + (-0.2) \cdot 2 = 0.10001 - 0.4 = -0.29999 $
   - $ h_2 = 0.3 \cdot 1 + 0.4 \cdot 2 = 0.3 + 0.8 = 1.1 $
   - $ y = 0.5 \cdot (-0.29999) + (-0.5) \cdot 1.1 = -0.149995 - 0.55 = -0.699995 $
   - $ L = \frac{1}{2} \cdot (-0.699995 - 5)^2 = \frac{1}{2} \cdot (25.899975000025) = 12.9499875000125 $
5. **Forward Pass for $ w_{11} - \epsilon $:**
   - $ w_{11} - \epsilon = 0.09999 $
   - $ h_1 = (0.09999) \cdot 1 + (-0.2) \cdot 2 = 0.09999 - 0.4 = -0.30001 $
   - $ h_2 = 0.3 \cdot 1 + 0.4 \cdot 2 = 0.3 + 0.8 = 1.1 $
   - $ y = 0.5 \cdot (-0.30001) + (-0.5) \cdot 1.1 = -0.150005 - 0.55 = -0.700005 $
   - $ L = \frac{1}{2} \cdot (-0.700005 - 5)^2 = \frac{1}{2} \cdot (25.900025000025) = 12.9500125000125 $
6. **Numerical Gradient for $ w_{11} $:**
   - $ \frac{\partial L}{\partial w_{11}} \approx \frac{L(w_{11} + \epsilon) - L(w_{11} - \epsilon)}{2\epsilon} $
   - $ \frac{\partial L}{\partial w_{11}} \approx \frac{12.9499875000125 - 12.9500125000125}{2 \times 10^{-5}} $
   - $ \frac{\partial L}{\partial w_{11}} \approx -1.25 $

Therefore, the numerical gradient for $ w_{11} $ is approximately $ -1.25 $.


## Jacobian Matrix

The Jacobian matrix of a vector-valued function $ f $ with respect to a vector $ w $ is a matrix of all first-order partial derivatives of the function. For our neural network, assuming $ f $ represents the network's outputs and $ w $ represents the weights, the Jacobian $ J $ is:

$$
J = \begin{bmatrix}
\frac{\partial y}{\partial w_{11}} & \frac{\partial y}{\partial w_{12}} & \frac{\partial y}{\partial w_{21}} & \frac{\partial y}{\partial w_{22}} & \frac{\partial y}{\partial v_1} & \frac{\partial y}{\partial v_2}
\end{bmatrix}
$$

From the forward pass, we have:

$$
\begin{align*}
y &= v_1 h_1 + v_2 h_2 \\
h_1 &= w_{11} x_1 + w_{12} x_2 \\
h_2 &= w_{21} x_1 + w_{22} x_2 \\
\end{align*}
$$

Thus:

$$
\begin{align*}
\frac{\partial y}{\partial w_{11}} &= v_1 x_1 \\
\frac{\partial y}{\partial w_{12}} &= v_1 x_2 \\
\frac{\partial y}{\partial w_{21}} &= v_2 x_1 \\
\frac{\partial y}{\partial w_{22}} &= v_2 x_2 \\
\frac{\partial y}{\partial v_1} &= h_1 \\
\frac{\partial y}{\partial v_2} &= h_2 \\
\end{align*}
$$

With the given values:

$$
J = \begin{bmatrix}
0.5 \cdot 1 & 0.5 \cdot 2 & -0.5 \cdot 1 & -0.5 \cdot 2 & -0.3 & 1.1
\end{bmatrix}
= \begin{bmatrix}
0.5 & 1 & -0.5 & -1 & -0.3 & 1.1
\end{bmatrix}
$$


## Hessian Matrix
The Hessian matrix $H$ is a square matrix of second-order partial derivatives of a scalar-valued function. For our loss function $L$, the Hessian $H$ is:

$$
H = \begin{bmatrix}
\frac{\partial^2 L}{\partial w_{11}^2} & \frac{\partial^2 L}{\partial w_{11} \partial w_{12}} & \cdots & \frac{\partial^2 L}{\partial w_{11} \partial v_{2}} \\
\frac{\partial^2 L}{\partial w_{12} \partial w_{11}} & \frac{\partial^2 L}{\partial w_{12}^2} & \cdots & \frac{\partial^2 L}{\partial w_{12} \partial v_{2}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^2 L}{\partial v_{2} \partial w_{11}} & \frac{\partial^2 L}{\partial v_{2} \partial w_{12}} & \cdots & \frac{\partial^2 L}{\partial v_{2}^2}
\end{bmatrix}
$$

Each element of $H$ can be computed using second-order derivatives of $L$. For example, to compute $\frac{\partial^2 L}{\partial w_{11}^2}$:

$$
\frac{\partial^2 L}{\partial w_{11}^2} = \frac{\partial}{\partial w_{11}}\left(\frac{\partial L}{\partial w_{11}}\right)
$$

Given:

$$
\frac{\partial L}{\partial w_{11}} = (y - t) \cdot v_1 \cdot x_1
$$

$$
\frac{\partial^2 L}{\partial w_{11}^2} = \frac{\partial}{\partial w_{11}}\left((y - t) \cdot v_1 \cdot x_1\right) = v_1 \cdot x_1 \cdot \frac{\partial (y - t)}{\partial w_{11}} = v_1 \cdot x_1 \cdot v_1 \cdot x_1 = v_1^2 \cdot x_1^2
$$

Similarly, we can compute other elements.


## Forward Mode Automatic Differentiation

Forward mode automatic differentiation propagates derivatives from inputs to outputs. For each input $$x_i$$, we track both the value and the derivative.

Let:

$$x_1 = 1, \quad x_2 = 2$$

Define initial perturbations:

$$\dot{x}_1 = 1, \quad \dot{x}_2 = 0$$

Compute $h_1$ and its derivative:

$$
h_1 = w_{11} \cdot x_1 + w_{12} \cdot x_2 = 0.1 \cdot 1 + (-0.2) \cdot 2 = -0.3
$$

$$
\dot{h}_1 = w_{11} \cdot \dot{x}_1 + w_{12} \cdot \dot{x}_2 = 0.1 \cdot 1 + (-0.2) \cdot 0 = 0.1
$$

Compute $h_2$ and its derivative:

$$
h_2 = w_{21} \cdot x_1 + w_{22} \cdot x_2 = 0.3 \cdot 1 + 0.4 \cdot 2 = 1.1
$$

$$
\dot{h}_2 = w_{21} \cdot \dot{x}_1 + w_{22} \cdot \dot{x}_2 = 0.3 \cdot 1 + 0.4 \cdot 0 = 0.3
$$

Compute $y$ and its derivative:

$$
y = v_1 \cdot h_1 + v_2 \cdot h_2 = 0.5 \cdot (-0.3) + (-0.5) \cdot 1.1 = -0.7
$$

$$
\dot{y} = v_1 \cdot \dot{h}_1 + v_2 \cdot \dot{h}_2 = 0.5 \cdot 0.1 + (-0.5) \cdot 0.3 = 0.05 - 0.15 = -0.1
$$

## Reverse Mode Automatic Differentiation

Reverse mode automatic differentiation propagates derivatives from outputs to inputs, effectively computing gradients in a backward pass.

Start with $y$ and backpropagate gradients:

$$
\frac{\partial L}{\partial y} = y - t = -0.7 - 5 = -5.7
$$

Backpropagate to hidden layer:

$$
\frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h_1} = -5.7 \cdot v_1 = -5.7 \cdot 0.5 = -2.85
$$

$$
\frac{\partial L}{\partial h_2} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h_2} = -5.7 \cdot v_2 = -5.7 \cdot (-0.5) = 2.85
$$

Backpropagate to input layer:

$$
\frac{\partial L}{\partial w_{11}} = \frac{\partial L}{\partial h_1} \cdot \frac{\partial h_1}{\partial w_{11}} = -2.85 \cdot x_1 = -2.85 \cdot 1 = -2.85
$$

$$
\frac{\partial L}{\partial w_{12}} = \frac{\partial L}{\partial h_1} \cdot \frac{\partial h_1}{\partial w_{12}} = -2.85 \cdot x_2 = -2.85 \cdot 2 = -5.7
$$

$$
\frac{\partial L}{\partial w_{21}} = \frac{\partial L}{\partial h_2} \cdot \frac{\partial h_2}{\partial w_{21}} = 2.85 \cdot x_1 = 2.85 \cdot 1 = 2.85
$$

$$
\frac{\partial L}{\partial w_{22}} = \frac{\partial L}{\partial h_2} \cdot \frac{\partial h_2}{\partial w_{22}} = 2.85 \cdot x_2 = 2.85 \cdot 2 = 5.7
$$

$$
\frac{\partial L}{\partial v_1} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial v_1} = -5.7 \cdot h_1 = -5.7 \cdot (-0.3) = 1.71
$$

$$
\frac{\partial L}{\partial v_2} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial v_2} = -5.7 \cdot h_2 = -5.7 \cdot 1.1 = -6.27
$$
