In [1]:
import numpy as np

## 8.1.2 General Feed-Forward Networks

**Single-Layer Pre-Activation Expression**
$$a_j = \sum_i w_{ji} z_i$$
This expresses a single pre-activation value $a_j$ as a weighted sum over inputs $z_i$ $i\in\{1, n\}$ and weights $w_{ji}$. The subscript $ji$ denotes that there are as many weights as inputs, but that these weights are only associated with the $j^{th}$ activation. When constructing layers as matrices, these weigted sums directly follow from dot-product matrix multiplication. Correspondingly, we should note that the number of inputs $i$ need not equal $j$; that is, the dimension of the domain need not equal the dimension of the range.

The inputs $z_i$ may either be input data or activations from a previous layer. For convenience, we will assume that $z_j$ denotes an activation going forward:
$$z_j = h(a_j)$$

Denoting the error function (loss function) for a specific input datum $x_n$ as $E_n$, the partial derivative of $E_n$ w.r.t. any weight $w_{ji}$ is readily obtained via the chain rule as:
$$\frac{\partial E_n}{\partial w_{ji}} = \frac{\partial E_n}{\partial a_j} \frac{\partial a_j}{\partial w_{ji}}$$
We denote ***errors*** as: $$\delta_j \coloneqq \frac{\partial E_n}{\partial a_j}$$

These are called errors because the partial derivative of the loss function w.r.t. the final activation is simply the estimation error when the loss function is the MSE loss. This is the same reason we call gradients *pseudo-residuals* in XGBM.

Assuming that the final activation function $h_j(\cdot)$ is the identity link (the canonical link function of the MSE loss), then $z_j \equiv a_j$.

Moving on:
$$\frac{\partial a_j}{\partial w_{ji}} = z_i$$

Then, $$\frac{\partial E_n}{\partial w_{ji}} = \delta_j z_i$$

We may wrap this up into a **"Back-Propogation Formula"**: $$\delta_j = h'(a_j) \sum_k w_{kj}\delta_k$$

**NOTE:**\
Using this expression, specifically the prior activations $z_i$, the partial derivatives of $E_n$ w.r.t. any weights only depend upon the information that come *before* the weights in the network. This allows us to iteratively compute the partial derivatives moving backwards through the network.

### Computational Cost

How does the computational complexity of backprop scale with the number of networ parameters $W$?

Since we may increase $W$ (the number of weights and biases) variably while holding the number of layers (and activation functions) fixed, the computational cost of backprop is $O(W)$

## 8.1.5 The Jacobian Matrix

The Jacobian matrix is defined here as the matrix of partial derivatives of the output $y$ w.r.t. the input $x$, such that the elements of the Jacobian are given by: $$J_{ki} \coloneqq \frac{\partial y_k}{\partial x_i}$$

When the input is an activation, the Jacobian w.r.t. the activation has elements: $$J_{kj} \coloneqq \frac{\partial y_k}{\partial z_j}$$
Allowing us to express:
$$\frac{\partial E}{\partial w} = \sum_{k,j} \frac{\partial E}{\partial y_k} J_{kj} \frac{\partial z_j}{\partial w}$$

Likewise, we may express the Jacobian for a layer as: $$J_{ki} = \frac{\partial y_k}{\partial z_i} = \sum_j \frac{\partial y_k}{\partial a_j}\frac{\partial a_j}{\partial z_i} = \sum_j w_{ji} \frac{\partial y_k}{\partial a_j}$$

And recurring:
$$ \frac{\partial y_k}{\partial a_j} = \sum_l \frac{\partial y_k}{\partial a_l} \frac{\partial a_l}{\partial a_j} = h'(a_j) \sum_l w_{li} \frac{\partial y_k}{\partial a_l}$$

# 8.2 Automatic Differentiation