Numpy Implementation Of NN From Theoretical Perspective

This repository implements a simple two-layer neural network by using pure numpy. To this end, one has to have mastery of the chain rule of derivatives especially in matrix notation. This repository will first elaborate on the derivation from theoretical perspective and give its numpy implementation.

Prerequisites

Before browsing the derivation, one has to install Github with MathJax into Chrome. After installation, refresh this page.

Derivation

Suppose we have a two-layer neural network illustrated as follow. Then we can obtain its forward pass:

$$h^T=f(u), u^T=x^TW$$ $$y^T=h^TS$$ $$\mathcal{L}=\mathcal{L}(y, t)$$

where $x^T \in R^{B \times n_x}$ is a mini-batch input, $W \in R^{n_x \times n_h}$ and $S \in {R}^{n_h \times n_y}$ are the learning parameters of the network, and $\mathcal{L}$ is the loss function.

Now, the objective is to derive the gradients with respect to $S$ and $W$, respectively. Let's first examine $\nabla_S \mathcal{L}$. By re-writting $y^T=h^TS$, we obtain $$\begin{equation} \begin{split} (y_1, y_2, ..., y_{n_y}) & = (h_1, h_2, ..., h_{n_h})S & = (h_1, h_2, ..., h_{n_h}) \begin{bmatrix} | & | & & |\ s_1 & s_2 & \cdots & s_{n_y}\ | & | & & |\ \end{bmatrix} \end{split} \end{equation}$$

or

$$y_1 = h^Ts_1$$ $$y_2 = h^Ts_2$$ $$\vdots$$ $$y_{n_y} = h^Ts_{n_y}$$

Then, $\frac{\partial\mathcal{L}}{\partial s_i} = \frac{\partial\mathcal{L}}{\partial y_i} \frac{\partial y_i}{\partial s_i} = \frac{\partial\mathcal{L}}{\partial y_i} h$. Therefore, $$\begin{equation} \nabla_S \mathcal{L} = \begin{bmatrix} | & | & & |\ \frac{\partial \mathcal{L}}{\partial s_1} & \frac{\partial \mathcal{L}}{\partial s_2} & \cdots & \frac{\partial \mathcal{L}}{\partial s_{n_y}}\ | & | & & |\ \end{bmatrix} = \begin{bmatrix} | & | & & |\ \frac{\partial \mathcal{L}}{\partial y_1}h & \frac{\partial \mathcal{L}}{\partial y_2}h & \cdots & \frac{\partial \mathcal{L}}{\partial y_{n_y}}h\ | & | & & |\ \end{bmatrix} = \left(\frac{\partial\mathcal{L}}{\partial v_i} h_j\right)_{i,j} = h \cdot \frac{\partial \mathcal{L}}{\partial y^T} \end{equation}$$.

For the greadient wrt $W$, we first determine the gradient wrt $h^T$ and apply the same rule above. Specifically,

$$\nabla_{h^T} \mathcal{L} = \nabla_{y^T} \mathcal{L} \cdot \nabla_{h^T} y = \nabla_{y^T}\mathcal{L} \cdot \begin{bmatrix} — & \nabla_{h^T} y_1 & —\\ — & \nabla_{h^T} y_2 & —\\ & \vdots & \\ — & \nabla_{h^T} y_{n_y} & —\\ \end{bmatrix} = \nabla_{y^T} \mathcal{L} \cdot S$$

Therefore, $$\nabla_W \mathcal{L} = x \cdot \nabla_{u^T} \mathcal{L}$$ where $\nabla_{u^T} \mathcal{L} = \text{diag}(\nabla f) \cdot \nabla_{h^T} \mathcal{L}$. Here, $\nabla f = (\frac {\partial{h_1}}{\partial u_1}, \frac{\partial h_2}{\partial u_2}, \cdots, \frac{\partial h_{n_h}}{\partial u_{n_h}})^T$.