<a href="https://colab.research.google.com/github/HiroTakeda/Notes/blob/main/NeuralNetwork.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Nerual Network

+ Related topics: __regression__, __logistic regression__

Neural network is a universal function approximator. This note shows how to implement the __linear layer__ (also known as __dense layer__ or __fully connected layer__), one of the basic component of the neural network approach, and it is applicalble to both __regression__ and __logistic regression__ problems.


## 2-Layer Neural Network
![picture](https://drive.google.com/uc?id=1_p7BO63LIfo9X8K52RkCuZw68A6_MIZt)
### Feed-forward network functions

The 1st layer

$$
\begin{eqnarray}
a_1 &=& w_{11}^{(1)} x_1 + w_{12}^{(1)} x_2 + \beta_1^{(1)}, \quad z_1 &=& h(a_1) \nonumber \\
a_2 &=& w_{21}^{(1)} x_1 + w_{22}^{(1)} x_2 + \beta_2^{(1)}, \quad z_2 &=& h(a_2) \nonumber
\end{eqnarray}
$$

The 2nd layer

$$
\begin{eqnarray}
b_1 &=& w_{11}^{(2)} z_1 + w_{12}^{(2)} z_2 + \beta_1^{(2)}, \quad y_1 &=& h(b_1) \nonumber \\
b_2 &=& w_{21}^{(2)} z_1 + w_{22}^{(2)} z_2 + \beta_2^{(2)}, \quad y_2 &=& h(b_2) \nonumber
\end{eqnarray}
$$




### Loss function

Given a set of training data $(\mathbf{x}, \mathbf{t})_n = ([x_1, x_2]^T, [t_1, t_2]^T)_n$ for $n=1,\cdots N$, we'd like to find the weights and bias values for each layer. Often, the number of samples ($N$) is very large, and the stochastic gradient descent method is more efficient than the gradient descent. Unlike the gradient descent method, the stochastic gradient descent method evaluates one sample at a time. Let's us one of the basic loss function, mean square error, here. For one sample, we have

$$
E = \displaystyle\frac{1}{2} (y_1 - t_1)^2 + \displaystyle\frac{1}{2} (y_2 - t_2)^2 = \displaystyle\frac{1}{2} \displaystyle\sum_{k} (y_k - t_k)^2
$$

### Stochastic gradient descent

The stochastic gradient descent method finds the weights and bias values by update iteratively using one sample at a time (a small set of samples are usually used):

$$
w_{ij}^{(\cdot)} \Leftarrow w_{ij}^{(\cdot)} + \eta \displaystyle\frac{\partial E}{\partial w_{ij}^{(\cdot)}}, \quad \beta_{i}^{(\cdot)} \Leftarrow \beta_{i}^{(\cdot)} + \eta \displaystyle\frac{\partial E}{\partial \beta_{i}^{(\cdot)}}
$$

where $\eta$ is a step size.

### Error backpropagation

The linear layers are cascaded, and the outputs of one layer are passed to the follwoing layer. To compute the gradients of weights and bias values, we can use the error backpropagation technique, in which the gradient of each weight can be computed by the chain rule.


---


The 2nd layer

$$
\begin{eqnarray}
\displaystyle\frac{\partial E}{\partial w_{11}^{(2)}} &=& \displaystyle\frac{\partial E}{\partial y_1} \displaystyle\frac{\partial y_1}{\partial w_{11}^{(2)}} = \displaystyle\frac{\partial E}{\partial y_1} \displaystyle\frac{\partial y_1}{\partial b_1} \displaystyle\frac{\partial b_1}{\partial w_{11}^{(2)}} = (y_1 - t_1) \, h'\!(b_1) \, z_1 = \delta_1 \, h'\!(b_1) \, z_1, \nonumber \\
& & \nonumber \\
\displaystyle\frac{\partial E}{\partial w_{12}^{(2)}} &=& \displaystyle\frac{\partial E}{\partial y_1} \displaystyle\frac{\partial y_1}{\partial w_{12}^{(2)}} = \displaystyle\frac{\partial E}{\partial y_1} \displaystyle\frac{\partial y_1}{\partial b_1} \displaystyle\frac{\partial b_1}{\partial w_{12}^{(2)}} = (y_1 - t_1) \, h'\!(b_1) \, z_2 = \delta_1 \, h'\!(b_1) \, z_2, \nonumber \\
& & \nonumber \\
\displaystyle\frac{\partial E}{\partial \beta_{1}^{(2)}} &=& \displaystyle\frac{\partial E}{\partial y_1} \displaystyle\frac{\partial y_1}{\partial \beta_{1}^{(2)}} = \displaystyle\frac{\partial E}{\partial y_1} \displaystyle\frac{\partial y_1}{\partial b_1} \displaystyle\frac{\partial b_1}{\partial \beta_{1}^{(2)}} = (y_1 - t_1) \, h'\!(b_1) = \delta_1 \, h'\!(b_1), \nonumber \\
& & \nonumber \\
\displaystyle\frac{\partial E}{\partial w_{21}^{(2)}} &=& \displaystyle\frac{\partial E}{\partial y_2} \displaystyle\frac{\partial y_2}{\partial w_{21}^{(2)}} = \displaystyle\frac{\partial E}{\partial y_2} \displaystyle\frac{\partial y_2}{\partial b_2} \displaystyle\frac{\partial b_2}{\partial w_{21}^{(2)}} = (y_2 - t_2) \, h'\!(b_1) \, z_1 = \delta_2 \, h'\!(b_1) \, z_1, \nonumber \\
& & \nonumber \\
\displaystyle\frac{\partial E}{\partial w_{22}^{(2)}} &=& \displaystyle\frac{\partial E}{\partial y_2} \displaystyle\frac{\partial y_2}{\partial w_{22}^{(2)}} = \displaystyle\frac{\partial E}{\partial y_2} \displaystyle\frac{\partial y_2}{\partial b_2} \displaystyle\frac{\partial b_2}{\partial w_{22}^{(2)}} = (y_2 - t_2) \, h'\!(b_1) \, z_2 = \delta_2 \, h'\!(b_1) \, z_2, \nonumber \\
& & \nonumber \\
\displaystyle\frac{\partial E}{\partial \beta_{2}^{(2)}} &=& \displaystyle\frac{\partial E}{\partial y_2} \displaystyle\frac{\partial y_2}{\partial \beta_{2}^{(2)}} = \displaystyle\frac{\partial E}{\partial y_2} \displaystyle\frac{\partial y_2}{\partial b_2} \displaystyle\frac{\partial b_2}{\partial \beta_{2}^{(2)}} = (y_2 - t_2) \, h'\!(b_2) = \delta_2 \, h'\!(b_2), \nonumber
\end{eqnarray}
$$



---


The 1st layer

$$
\displaystyle\frac{\partial E}{\partial w_{11}^{(1)}} = \displaystyle\frac{\partial E}{\partial a_1} \displaystyle\frac{\partial a_1}{\partial w_{11}^{(1)}} = \left\{ \delta_1\, h'\!(b_1) \, w_{11}^{(2)} + \delta_2\, h'\!(b_2) \, w_{21}^{(2)} \right\} h'\!(a_1)\, x_1
$$

where

$$
\begin{eqnarray}
\displaystyle\frac{\partial E}{\partial a_1} &=& \displaystyle\frac{\partial E}{\partial b_1} \displaystyle\frac{\partial b_1}{\partial a_1} + \displaystyle\frac{\partial E}{\partial b_2} \displaystyle\frac{\partial b_2}{\partial a_1} \nonumber \\
& & \nonumber \\
&=& \displaystyle\frac{\partial E}{\partial y_1} \displaystyle\frac{\partial y_1}{\partial b_1} \displaystyle\frac{\partial b_1}{\partial a_1} + \displaystyle\frac{\partial E}{\partial y_2} \displaystyle\frac{\partial y_2}{\partial b_2} \displaystyle\frac{\partial b_2}{\partial a_1} \nonumber \\
& & \nonumber \\
&=& \delta_1 \, h'\!(b_1)\, \displaystyle\frac{\partial b_1}{\partial a_1} + \delta_2 \, h'\!(b_2)\, \displaystyle\frac{\partial b_2}{\partial a_1} \nonumber
\end{eqnarray}
$$

and

$$
\begin{eqnarray}
\displaystyle\frac{\partial b_1}{\partial a_1} &=& \displaystyle\frac{\partial b_1}{\partial z_1} \displaystyle\frac{\partial z_1}{\partial a_1} + \displaystyle\frac{\partial b_1}{\partial z_2} \displaystyle\frac{\partial z_2}{\partial a_1} = w_{11}^{(2)} \, h'\!(a_1) + w_{12}^{(2)} \cdot 0 = w_{11}^{(2)} \, h'\!(a_1), \nonumber \\
& & \nonumber \\
\displaystyle\frac{\partial b_2}{\partial a_1} &=& \displaystyle\frac{\partial b_2}{\partial z_1} \displaystyle\frac{\partial z_1}{\partial a_1} + \displaystyle\frac{\partial b_2}{\partial z_2} \displaystyle\frac{\partial z_2}{\partial a_1} = w_{21}^{(2)} \, h'\!(a_1) + w_{22}^{(2)} \cdot 0 = w_{21}^{(2)} \, h'\!(a_1). \nonumber
\end{eqnarray}
$$

Similarly we have

$$
\begin{eqnarray}
\displaystyle\frac{\partial E}{\partial w_{11}^{(1)}} &=& \displaystyle\frac{\partial E}{\partial a_1} \displaystyle\frac{\partial a_1}{\partial w_{11}^{(1)}} = \left\{ \delta_1\, h'\!(b_1) \, w_{11}^{(2)} + \delta_2\, h'\!(b_2) \, w_{21}^{(2)} \right\} h'\!(a_1)\, x_1, \nonumber \\
& & \nonumber \\
\displaystyle\frac{\partial E}{\partial w_{12}^{(1)}} &=& \displaystyle\frac{\partial E}{\partial a_1} \displaystyle\frac{\partial a_1}{\partial w_{12}^{(1)}} = \left\{ \delta_1\, h'\!(b_1) \, w_{11}^{(2)} + \delta_2\, h'\!(b_2) \, w_{21}^{(2)} \right\} h'\!(a_1)\, x_2, \nonumber \\
& & \nonumber \\
\displaystyle\frac{\partial E}{\partial \beta_1^{(1)}} &=& \displaystyle\frac{\partial E}{\partial a_1} \displaystyle\frac{\partial a_1}{\partial \beta_1^{(1)}} = \left\{ \delta_1\, h'\!(b_1) \, w_{11}^{(2)} + \delta_2\, h'\!(b_2) \, w_{21}^{(2)} \right\} h'\!(a_1), \nonumber \\
& & \nonumber \\
\displaystyle\frac{\partial E}{\partial w_{21}^{(1)}} &=& \displaystyle\frac{\partial E}{\partial a_2} \displaystyle\frac{\partial a_2}{\partial w_{21}^{(1)}} = \left\{ \delta_1\, h'\!(b_1) \, w_{12}^{(2)} + \delta_2\, h'\!(b_2) \, w_{22}^{(2)} \right\} h'\!(a_2)\, x_1, \nonumber \\
& & \nonumber \\
\displaystyle\frac{\partial E}{\partial w_{22}^{(1)}} &=& \displaystyle\frac{\partial E}{\partial a_2} \displaystyle\frac{\partial a_2}{\partial w_{22}^{(1)}} = \left\{ \delta_1\, h'\!(b_1) \, w_{12}^{(2)} + \delta_2\, h'\!(b_2) \, w_{22}^{(2)} \right\} h'\!(a_2)\, x_2, \nonumber \\
& & \nonumber \\
\displaystyle\frac{\partial E}{\partial \beta_2^{(1)}} &=& \displaystyle\frac{\partial E}{\partial a_2} \displaystyle\frac{\partial a_2}{\partial \beta_2^{(1)}} = \left\{ \delta_1\, h'\!(b_1) \, w_{12}^{(2)} + \delta_2\, h'\!(b_2) \, w_{22}^{(2)} \right\} h'\!(a_2), \nonumber
\end{eqnarray}
$$